• Open

    Semiconductor Fab Scheduling with Self-Supervised and Reinforcement Learning. (arXiv:2302.07162v1 [cs.AI])
    Semiconductor manufacturing is a notoriously complex and costly multi-step process involving a long sequence of operations on expensive and quantity-limited equipment. Recent chip shortages and their impacts have highlighted the importance of semiconductors in the global supply chains and how reliant on those our daily lives are. Due to the investment cost, environmental impact, and time scale needed to build new factories, it is difficult to ramp up production when demand spikes. This work introduces a method to successfully learn to schedule a semiconductor manufacturing facility more efficiently using deep reinforcement and self-supervised learning. We propose the first adaptive scheduling approach to handle complex, continuous, stochastic, dynamic, modern semiconductor manufacturing models. Our method outperforms the traditional hierarchical dispatching strategies typically used in semiconductor manufacturing plants, substantially reducing each order's tardiness and time until completion. As a result, our method yields a better allocation of resources in the semiconductor manufacturing process.  ( 2 min )
    Learning a model is paramount for sample efficiency in reinforcement learning control of PDEs. (arXiv:2302.07160v1 [cs.LG])
    The goal of this paper is to make a strong point for the usage of dynamical models when using reinforcement learning (RL) for feedback control of dynamical systems governed by partial differential equations (PDEs). To breach the gap between the immense promises we see in RL and the applicability in complex engineering systems, the main challenges are the massive requirements in terms of the training data, as well as the lack of performance guarantees. We present a solution for the first issue using a data-driven surrogate model in the form of a convolutional LSTM with actuation. We demonstrate that learning an actuated model in parallel to training the RL agent significantly reduces the total amount of required data sampled from the real system. Furthermore, we show that iteratively updating the model is of major importance to avoid biases in the RL training. Detailed ablation studies reveal the most important ingredients of the modeling process. We use the chaotic Kuramoto-Sivashinsky equation do demonstarte our findings.  ( 2 min )
    A Data Mining Approach for Detecting Collusion in Unproctored Online Exams. (arXiv:2302.07014v1 [cs.CY])
    Due to the precautionary measures during the COVID-19 pandemic many universities offered unproctored take-home exams. We propose methods to detect potential collusion between students and apply our approach on event log data from take-home exams during the pandemic. We find groups of students with suspiciously similar exams. In addition, we compare our findings to a proctored control group. By this, we establish a rule of thumb for evaluating which cases are "outstandingly similar", i.e., suspicious cases.  ( 2 min )
    The Meta-Evaluation Problem in Explainable AI: Identifying Reliable Estimators with MetaQuantus. (arXiv:2302.07265v1 [cs.LG])
    Explainable AI (XAI) is a rapidly evolving field that aims to improve transparency and trustworthiness of AI systems to humans. One of the unsolved challenges in XAI is estimating the performance of these explanation methods for neural networks, which has resulted in numerous competing metrics with little to no indication of which one is to be preferred. In this paper, to identify the most reliable evaluation method in a given explainability context, we propose MetaQuantus -- a simple yet powerful framework that meta-evaluates two complementary performance characteristics of an evaluation method: its resilience to noise and reactivity to randomness. We demonstrate the effectiveness of our framework through a series of experiments, targeting various open questions in XAI, such as the selection of explanation methods and optimisation of hyperparameters of a given metric. We release our work under an open-source license to serve as a development tool for XAI researchers and Machine Learning (ML) practitioners to verify and benchmark newly constructed metrics (i.e., ``estimators'' of explanation quality). With this work, we provide clear and theoretically-grounded guidance for building reliable evaluation methods, thus facilitating standardisation and reproducibility in the field of XAI.  ( 2 min )
    Residual Policy Learning for Vehicle Control of Autonomous Racing Cars. (arXiv:2302.07035v1 [cs.RO])
    The development of vehicle controllers for autonomous racing is challenging because racing cars operate at their physical driving limit. Prompted by the demand for improved performance, autonomous racing research has seen the proliferation of machine learning-based controllers. While these approaches show competitive performance, their practical applicability is often limited. Residual policy learning promises to mitigate this by combining classical controllers with learned residual controllers. The critical advantage of residual controllers is their high adaptability parallel to the classical controller's stable behavior. We propose a residual vehicle controller for autonomous racing cars that learns to amend a classical controller for the path-following of racing lines. In an extensive study, performance gains of our approach are evaluated for a simulated car of the F1TENTH autonomous racing series. The evaluation for twelve replicated real-world racetracks shows that the residual controller reduces lap times by an average of 4.55 % compared to a classical controller and zero-shot generalizes to new racetracks.  ( 2 min )
    A Deep Probabilistic Spatiotemporal Framework for Dynamic Graph Representation Learning with Application to Brain Disorder Identification. (arXiv:2302.07243v1 [cs.LG])
    Recent applications of pattern recognition techniques on brain connectome classification using functional connectivity (FC) neglect the non-Euclidean topology and causal dynamics of brain connectivity across time. In this paper, a deep probabilistic spatiotemporal framework developed based on variational Bayes (DSVB) is proposed to learn time-varying topological structures in dynamic brain FC networks for autism spectrum disorder (ASD) identification. The proposed framework incorporates a spatial-aware recurrent neural network to capture rich spatiotemporal patterns across dynamic FC networks, followed by a fully-connected neural network to exploit these learned patterns for subject-level classification. To overcome model overfitting on limited training datasets, an adversarial training strategy is introduced to learn graph embedding models that generalize well to unseen brain networks. Evaluation on the ABIDE resting-state functional magnetic resonance imaging dataset shows that our proposed framework significantly outperformed state-of-the-art methods in identifying ASD. Dynamic FC analyses with DSVB learned embeddings reveal apparent group difference between ASD and healthy controls in network profiles and switching dynamics of brain states.  ( 2 min )
    Bounding Training Data Reconstruction in DP-SGD. (arXiv:2302.07225v1 [cs.CR])
    Differentially private training offers a protection which is usually interpreted as a guarantee against membership inference attacks. By proxy, this guarantee extends to other threats like reconstruction attacks attempting to extract complete training examples. Recent works provide evidence that if one does not need to protect against membership attacks but instead only wants to protect against training data reconstruction, then utility of private models can be improved because less noise is required to protect against these more ambitious attacks. We investigate this further in the context of DP-SGD, a standard algorithm for private deep learning, and provide an upper bound on the success of any reconstruction attack against DP-SGD together with an attack that empirically matches the predictions of our bound. Together, these two results open the door to fine-grained investigations on how to set the privacy parameters of DP-SGD in practice to protect against reconstruction attacks. Finally, we use our methods to demonstrate that different settings of the DP-SGD parameters leading to the same DP guarantees can result in significantly different success rates for reconstruction, indicating that the DP guarantee alone might not be a good proxy for controlling the protection against reconstruction attacks.  ( 2 min )
    Joint Probability Trees. (arXiv:2302.07167v1 [cs.LG])
    We introduce Joint Probability Trees (JPT), a novel approach that makes learning of and reasoning about joint probability distributions tractable for practical applications. JPTs support both symbolic and subsymbolic variables in a single hybrid model, and they do not rely on prior knowledge about variable dependencies or families of distributions. JPT representations build on tree structures that partition the problem space into relevant subregions that are elicited from the training data instead of postulating a rigid dependency model prior to learning. Learning and reasoning scale linearly in JPTs, and the tree structure allows white-box reasoning about any posterior probability $P(Q|E)$, such that interpretable explanations can be provided for any inference result. Our experiments showcase the practical applicability of JPTs in high-dimensional heterogeneous probability spaces with millions of training samples, making it a promising alternative to classic probabilistic graphical models.  ( 2 min )
    Kernelized Diffusion maps. (arXiv:2302.06757v1 [stat.ML])
    Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension d. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert space method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nystr\"om subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.  ( 2 min )
    Optimal Transport for Change Detection on LiDAR Point Clouds. (arXiv:2302.07025v1 [cs.CV])
    The detection of changes occurring in multi-temporal remote sensing data plays a crucial role in monitoring several aspects of real life, such as disasters, deforestation, and urban planning. In the latter context, identifying both newly built and demolished buildings is essential to help landscape and city managers to promote sustainable development. While the use of airborne LiDAR point clouds has become widespread in urban change detection, the most common approaches require the transformation of a point cloud into a regular grid of interpolated height measurements, i.e. Digital Elevation Model (DEM). However, the DEM's interpolation step causes an information loss related to the height of the objects, affecting the detection capability of building changes, where the high resolution of LiDAR point clouds in the third dimension would be the most beneficial. Notwithstanding recent attempts to detect changes directly on point clouds using either a distance-based computation method or a semantic segmentation pre-processing step, only the M3C2 distance computation-based approach can identify both positive and negative changes, which is of paramount importance in urban planning. Motivated by the previous arguments, we introduce a principled change detection pipeline, based on optimal transport, capable of distinguishing between newly built buildings (positive changes) and demolished ones (negative changes). In this work, we propose to use unbalanced optimal transport to cope with the creation and destruction of mass related to building changes occurring in a bi-temporal pair of LiDAR point clouds. We demonstrate the efficacy of our approach on the only publicly available airborne LiDAR dataset for change detection by showing superior performance over the M3C2 and the previous optimal transport-based method presented by Nicolas Courty et al.at IGARSS 2016.  ( 2 min )
    Effects of Locality and Rule Language on Explanations for Knowledge Graph Embeddings. (arXiv:2302.06967v1 [cs.AI])
    Knowledge graphs (KGs) are key tools in many AI-related tasks such as reasoning or question answering. This has, in turn, propelled research in link prediction in KGs, the task of predicting missing relationships from the available knowledge. Solutions based on KG embeddings have shown promising results in this matter. On the downside, these approaches are usually unable to explain their predictions. While some works have proposed to compute post-hoc rule explanations for embedding-based link predictors, these efforts have mostly resorted to rules with unbounded atoms, e.g., bornIn(x,y) => residence(x,y), learned on a global scope, i.e., the entire KG. None of these works has considered the impact of rules with bounded atoms such as nationality(x,England) => speaks(x, English), or the impact of learning from regions of the KG, i.e., local scopes. We therefore study the effects of these factors on the quality of rule-based explanations for embedding-based link predictors. Our results suggest that more specific rules and local scopes can improve the accuracy of the explanations. Moreover, these rules can provide further insights about the inner-workings of KG embeddings for link prediction.  ( 2 min )
    SubTuning: Efficient Finetuning for Multi-Task Learning. (arXiv:2302.06354v2 [cs.LG] UPDATED)
    Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance. In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, keeping the rest of the weights frozen at their initial (pretrained) values. We demonstrate that \emph{subset finetuning} (or SubTuning) often achieves accuracy comparable to full finetuning of the model, and even surpasses the performance of full finetuning when training data is scarce. Therefore, SubTuning allows deploying new tasks at minimal computational cost, while enjoying the benefits of finetuning the entire model. This yields a simple and effective method for multi-task learning, where different tasks do not interfere with one another, and yet share most of the resources at inference time. We demonstrate the efficiency of SubTuning across multiple tasks, using different network architectures and pretraining methods.  ( 2 min )
    Languages are Rewards: Chain of Hindsight Finetuning using Human Feedback. (arXiv:2302.02676v2 [cs.LG] UPDATED)
    Learning from human preferences is important for language models to be helpful and useful for humans, and to align with human and social values. Existing works focus on supervised finetuning of pretrained models, based on curated model generations that are preferred by human labelers. Such works have achieved remarkable successes in understanding and following instructions (e.g., InstructGPT, ChatGPT, etc). However, to date, a key limitation of supervised finetuning is that it cannot learn from negative ratings; models are only trained on positive-rated data, which makes it data inefficient. Because collecting human feedback data is both time consuming and expensive, it is vital for the model to learn from all feedback, akin to the remarkable ability of humans to learn from diverse feedback. In this work, we propose a novel technique called Hindsight Finetuning for making language models learn from diverse human feedback. In fact, our idea is motivated by how humans learn from hindsight experience. We condition the model on a sequence of model generations paired with hindsight feedback, and finetune the model to predict the most preferred output. By doing so, models can learn to identify and correct negative attributes or errors. Applying the method to GPT-J, we observe that it significantly improves results on summarization and dialogue tasks using the same amount of human feedback.  ( 2 min )
    A modern look at the relationship between sharpness and generalization. (arXiv:2302.07011v1 [cs.LG])
    Sharpness of minima is a promising quantity that can correlate with generalization in deep networks and, when optimized during training, can improve generalization. However, standard sharpness is not invariant under reparametrizations of neural networks, and, to fix this, reparametrization-invariant sharpness definitions have been proposed, most prominently adaptive sharpness (Kwon et al., 2021). But does it really capture generalization in modern practical settings? We comprehensively explore this question in a detailed study of various definitions of adaptive sharpness in settings ranging from training from scratch on ImageNet and CIFAR-10 to fine-tuning CLIP on ImageNet and BERT on MNLI. We focus mostly on transformers for which little is known in terms of sharpness despite their widespread usage. Overall, we observe that sharpness does not correlate well with generalization but rather with some training parameters like the learning rate that can be positively or negatively correlated with generalization depending on the setup. Interestingly, in multiple cases, we observe a consistent negative correlation of sharpness with out-of-distribution error implying that sharper minima can generalize better. Finally, we illustrate on a simple model that the right sharpness measure is highly data-dependent, and that we do not understand well this aspect for realistic data distributions. The code of our experiments is available at https://github.com/tml-epfl/sharpness-vs-generalization.  ( 2 min )
    When Mitigating Bias is Unfair: A Comprehensive Study on the Impact of Bias Mitigation Algorithms. (arXiv:2302.07185v1 [cs.LG])
    Most works on the fairness of machine learning systems focus on the blind optimization of common fairness metrics, such as Demographic Parity and Equalized Odds. In this paper, we conduct a comparative study of several bias mitigation approaches to investigate their behaviors at a fine grain, the prediction level. Our objective is to characterize the differences between fair models obtained with different approaches. With comparable performances in fairness and accuracy, are the different bias mitigation approaches impacting a similar number of individuals? Do they mitigate bias in a similar way? Do they affect the same individuals when debiasing a model? Our findings show that bias mitigation approaches differ a lot in their strategies, both in the number of impacted individuals and the populations targeted. More surprisingly, we show these results even apply for several runs of the same mitigation approach. These findings raise questions about the limitations of the current group fairness metrics, as well as the arbitrariness, hence unfairness, of the whole debiasing process.  ( 2 min )
    RevUp: Revise and Update Information Bottleneck for Event Representation. (arXiv:2205.12248v2 [cs.LG] UPDATED)
    The existence of external (``side'') semantic knowledge has been shown to result in more expressive computational event models. To enable the use of side information that may be noisy or missing, we propose a semi-supervised information bottleneck-based discrete latent variable model. We reparameterize the model's discrete variables with auxiliary continuous latent variables and a light-weight hierarchical structure. Our model is learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes past approaches, and perform an empirical case study of our approach on event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.  ( 2 min )
    A Bandit Approach to Online Pricing for Heterogeneous Edge Resource Allocation. (arXiv:2302.06953v1 [cs.LG])
    Edge Computing (EC) offers a superior user experience by positioning cloud resources in close proximity to end users. The challenge of allocating edge resources efficiently while maximizing profit for the EC platform remains a sophisticated problem, especially with the added complexity of the online arrival of resource requests. To address this challenge, we propose to cast the problem as a multi-armed bandit problem and develop two novel online pricing mechanisms, the Kullback-Leibler Upper Confidence Bound (KL-UCB) algorithm and the Min-Max Optimal algorithm, for heterogeneous edge resource allocation. These mechanisms operate in real-time and do not require prior knowledge of demand distribution, which can be difficult to obtain in practice. The proposed posted pricing schemes allow users to select and pay for their preferred resources, with the platform dynamically adjusting resource prices based on observed historical data. Numerical results show the advantages of the proposed mechanisms compared to several benchmark schemes derived from traditional bandit algorithms, including the Epsilon-Greedy, basic UCB, and Thompson Sampling algorithms.  ( 2 min )
    Measuring Data. (arXiv:2212.05129v2 [cs.AI] UPDATED)
    We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of this work together, particularly in fields of computer vision and language, and build from it to motivate measuring data as a critical component of responsible AI development. Measuring data aids in systematically building and analyzing machine learning (ML) data towards specific goals and gaining better control of what modern ML systems will learn. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
    Does CLIP Know My Face?. (arXiv:2209.07341v2 [cs.LG] UPDATED)
    With the rise of deep learning in various applications, privacy concerns around the protection of training data has become a critical area of research. Whereas prior studies have focused on privacy risks in single-modal models, we introduce a novel method to assess privacy for multi-modal models, specifically vision-language models like CLIP. The proposed Identity Inference Attack (IDIA) reveals whether an individual was included in the training data by querying the model with images of the same person. Letting the model choose from a wide variety of possible text labels, the model reveals whether it recognizes the person and, therefore, was used for training. Our large-scale experiments on CLIP demonstrate that individuals used for training can be identified with very high accuracy. We confirm that the model has learned to associate names with depicted individuals, implying the existence of sensitive information that can be extracted by adversaries. Our results highlight the need for stronger privacy protection in large-scale models and suggest that IDIAs can be used to prove the unauthorized use of data for training and to enforce privacy laws.
    A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies. (arXiv:2302.06218v2 [cs.LG] UPDATED)
    Ever since their conception, Transformers have taken over traditional sequence models in many tasks, such as NLP, image classification, and video/audio processing, for their fast training and superior performance. Much of the merit is attributable to positional encoding and multi-head attention. However, Transformers fall short in learning long-range dependencies mainly due to the quadratic complexity scaled with context length, in terms of both time and space. Consequently, over the past five years, a myriad of methods has been proposed to make Transformers more efficient. In this work, we first take a step back, study and compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation. Specifically, we summarize them using a unified template, given their shared nature of token mixing. Through benchmarks, we then demonstrate that long context length does yield better performance, albeit application-dependent, and traditional Transformer models fall short in taking advantage of long-range dependencies. Next, inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies. As a proof of concept, we evaluate the performance of one essential component of this system, namely, the distributed multi-head attention. We show that our algorithm can scale up attention computation by almost $40\times$ using four GeForce RTX 4090 GPUs, compared to vanilla multi-head attention mechanism. We believe this study is an instrumental step towards modeling million-scale dependencies.
    Learning from Noisy Crowd Labels with Logics. (arXiv:2302.06337v2 [cs.LG] UPDATED)
    This paper explores the integration of symbolic logic knowledge into deep neural networks for learning from noisy crowd labels. We introduce Logic-guided Learning from Noisy Crowd Labels (Logic-LNCL), an EM-alike iterative logic knowledge distillation framework that learns from both noisy labeled data and logic rules of interest. Unlike traditional EM methods, our framework contains a ``pseudo-E-step'' that distills from the logic rules a new type of learning target, which is then used in the ``pseudo-M-step'' for training the classifier. Extensive evaluations on two real-world datasets for text sentiment classification and named entity recognition demonstrate that the proposed framework improves the state-of-the-art and provides a new solution to learning from noisy crowd labels.
    Goal-Space Planning with Subgoal Models. (arXiv:2206.02902v4 [cs.LG] UPDATED)
    This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.
    FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data. (arXiv:2207.10265v3 [cs.LG] UPDATED)
    Federated learning (FL) allows agents to jointly train a global model without sharing their local data. However, due to the heterogeneous nature of local data, it is challenging to optimize or even define fairness of the trained global model for the agents. For instance, existing work usually considers accuracy equity as fairness for different agents in FL, which is limited, especially under the heterogeneous setting, since it is intuitively "unfair" to enforce agents with high-quality data to achieve similar accuracy to those who contribute low-quality data, which may discourage the agents from participating in FL. In this work, we propose a formal FL fairness definition, fairness via agent-awareness (FAA), which takes different contributions of heterogeneous agents into account. Under FAA, the performance of agents with high-quality data will not be sacrificed just due to the existence of large amounts of agents with low-quality data. In addition, we propose a fair FL training algorithm based on agent clustering (FOCUS) to achieve fairness in FL measured by FAA. Theoretically, we prove the convergence and optimality of FOCUS under mild conditions for linear and general convex loss functions with bounded smoothness. We also prove that FOCUS always achieves higher fairness in terms of FAA compared with standard FedAvg under both linear and general convex loss functions. Empirically, we show that on four FL datasets, including synthetic data, images, and texts, FOCUS achieves significantly higher fairness in terms of FAA while maintaining competitive prediction accuracy compared with FedAvg and state-of-the-art fair FL algorithms.
    Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. (arXiv:2208.10264v4 [cs.CL] UPDATED)
    We introduce a new type of test, called a Turing Experiment (TE), for evaluating how well a language model, such as GPT-3, can simulate different aspects of human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We give TEs that attempt to replicate well-established findings in prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models.
    Quantum algorithms applied to satellite mission planning for Earth observation. (arXiv:2302.07181v1 [quant-ph])
    Earth imaging satellites are a crucial part of our everyday lives that enable global tracking of industrial activities. Use cases span many applications, from weather forecasting to digital maps, carbon footprint tracking, and vegetation monitoring. However, there are also limitations; satellites are difficult to manufacture, expensive to maintain, and tricky to launch into orbit. Therefore, it is critical that satellites are employed efficiently. This poses a challenge known as the satellite mission planning problem, which could be computationally prohibitive to solve on large scales. However, close-to-optimal algorithms can often provide satisfactory resolutions, such as greedy reinforcement learning, and optimization algorithms. This paper introduces a set of quantum algorithms to solve the mission planning problem and demonstrate an advantage over the classical algorithms implemented thus far. The problem is formulated as maximizing the number of high-priority tasks completed on real datasets containing thousands of tasks and multiple satellites. This work demonstrates that through solution-chaining and clustering, optimization and machine learning algorithms offer the greatest potential for optimal solutions. Most notably, this paper illustrates that a hybridized quantum-enhanced reinforcement learning agent can achieve a completion percentage of 98.5% over high-priority tasks, which is a significant improvement over the baseline greedy methods with a completion rate of 63.6%. The results presented in this work pave the way to quantum-enabled solutions in the space industry and, more generally, future mission planning problems across industries.
    Online SuBmodular + SuPermodular (BP) Maximization with Bandit Feedback. (arXiv:2207.03091v2 [cs.LG] UPDATED)
    We investigate non-modular function maximization in an online setting with $m$ users. The optimizer maintains a set $S_q$ for each user $q \in \{1, \ldots, m\}$. At round $i$, a user with unknown utility $h_q$ arrives; the optimizer selects a new item to add to $S_q$, and receives a noisy marginal gain. The goal is to minimize regret compared to an $\alpha$-approximation to the optimal full-knowledge selection (i.e., $\alpha$-regret). Prior works study this problem under a submodularity assumption for all $h_q$. However, this is not ideally amenable to applications, e.g., movie recommendations, that involve complementarity between items, where e.g., watching the first movie in a series enhances the impression of watching the sequels. Hence, we consider objectives $h_q$, called \textit{BP functions}, that decompose into the sum of monotone submodular $f_q$ and supermodular $g_q$; here, $g_q$ naturally models complementarity. Under different feedback assumptions, we develop UCB-style algorithms that use Nystrom sampling for computational efficiency. For these, we provide sublinear $\alpha$-regret guarantees for $\alpha = 1/\kappa_{f} [1 - e^{-(1 - \kappa^g) \kappa_{f}} ]$, and $\alpha = \min\{1 - \kappa_f/e, 1 - \kappa^g\}$; here, $\kappa_f, \kappa^g$ are submodular and supermodular curvatures. Furthermore, we provide similar $\alpha$-regret guarantees for functions that are almost submodular where $\alpha$ is parameterized by the submodularity ratio of the objective functions. We numerically validate our algorithms for movie recommendation on the MovieLens dataset and selection of training subsets for classification tasks.
    Quasi-Newton Steps for Efficient Online Exp-Concave Optimization. (arXiv:2211.01357v3 [math.OC] UPDATED)
    The aim of this paper is to design computationally-efficient and optimal algorithms for the online and stochastic exp-concave optimization settings. Typical algorithms for these settings, such as the Online Newton Step (ONS), can guarantee a $O(d\ln T)$ bound on their regret after $T$ rounds, where $d$ is the dimension of the feasible set. However, such algorithms perform so-called generalized projections whenever their iterates step outside the feasible set. Such generalized projections require $\Omega(d^3)$ arithmetic operations even for simple sets such a Euclidean ball, making the total runtime of ONS of order $d^3 T$ after $T$ rounds, in the worst-case. In this paper, we side-step generalized projections by using a self-concordant barrier as a regularizer to compute the Newton steps. This ensures that the iterates are always within the feasible set without requiring projections. This approach still requires the computation of the inverse of the Hessian of the barrier at every step. However, using the stability properties of the Newton steps, we show that the inverse of the Hessians can be efficiently approximated via Taylor expansions for most rounds, resulting in a $O(d^2 T +d^\omega \sqrt{T})$ total computational complexity, where $\omega$ is the exponent of matrix multiplication. In the stochastic setting, we show that this translates into a $O(d^3/\epsilon)$ computational complexity for finding an $\epsilon$-suboptimal point, answering an open question by Koren 2013. We first show these new results for the simple case where the feasible set is a Euclidean ball. Then, to move to general convex set, we use a reduction to Online Convex Optimization over the Euclidean ball. Our final algorithm can be viewed as a more efficient version of ONS.
    Relative Sparsity for Medical Decision Problems. (arXiv:2211.16566v2 [stat.ME] UPDATED)
    Existing statistical methods can estimate a policy, or a mapping from covariates to decisions, which can then instruct decision makers (e.g., whether to administer hypotension treatment based on covariates blood pressure and heart rate). There is great interest in using such data-driven policies in healthcare. However, it is often important to explain to the healthcare provider, and to the patient, how a new policy differs from the current standard of care. This end is facilitated if one can pinpoint the aspects of the policy (i.e., the parameters for blood pressure and heart rate) that change when moving from the standard of care to the new, suggested policy. To this end, we adapt ideas from Trust Region Policy Optimization (TRPO). In our work, however, unlike in TRPO, the difference between the suggested policy and standard of care is required to be sparse, aiding with interpretability. This yields ``relative sparsity," where, as a function of a tuning parameter, $\lambda$, we can approximately control the number of parameters in our suggested policy that differ from their counterparts in the standard of care (e.g., heart rate only). We propose a criterion for selecting $\lambda$, perform simulations, and illustrate our method with a real, observational healthcare dataset, deriving a policy that is easy to explain in the context of the current standard of care. Our work promotes the adoption of data-driven decision aids, which have great potential to improve health outcomes.
    Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data. (arXiv:2302.07194v1 [cs.LG])
    Diffusion models achieve state-of-the-art performance in various generation tasks. However, their theoretical foundations fall far behind. This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. Our result provides sample complexity bounds for distribution estimation using diffusion models. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. Furthermore, the generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution. The convergence rate depends on the subspace dimension, indicating that diffusion models can circumvent the curse of data ambient dimensionality.
    Fair Densities via Boosting the Sufficient Statistics of Exponential Families. (arXiv:2012.00188v3 [stat.ML] UPDATED)
    We introduce a boosting algorithm to pre-process data for fairness. Starting from an initial fair but inaccurate distribution, our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. To do so, it learns the sufficient statistics of an exponential family with boosting-compliant convergence. Importantly, we are able to theoretically prove that the learned distribution will have a representation rate and statistical rate data fairness guarantee. Unlike recent optimization based pre-processing methods, our approach can be easily adapted for continuous domain features. Furthermore, when the weak learners are specified to be decision trees, the sufficient statistics of the learned distribution can be examined to provide clues on sources of (un)fairness. Empirical results are present to display the quality of result on real-world data.
    A Novel Poisoned Water Detection Method Using Smartphone Embedded Wi-Fi Technology and Machine Learning Algorithms. (arXiv:2302.07153v1 [eess.SP])
    Water is a necessary fluid to the human body and automatic checking of its quality and cleanness is an ongoing area of research. One such approach is to present the liquid to various types of signals and make the amount of signal attenuation an indication of the liquid category. In this article, we have utilized the Wi-Fi signal to distinguish clean water from poisoned water via training different machine learning algorithms. The Wi-Fi access points (WAPs) signal is acquired via equivalent smartphone-embedded Wi-Fi chipsets, and then Channel-State-Information CSI measures are extracted and converted into feature vectors to be used as input for machine learning classification algorithms. The measured amplitude and phase of the CSI data are selected as input features into four classifiers k-NN, SVM, LSTM, and Ensemble. The experimental results show that the model is adequate to differentiate poison water from clean water with a classification accuracy of 89% when LSTM is applied, while 92% classification accuracy is achieved when the AdaBoost-Ensemble classifier is applied.
    Event Detection on Dynamic Graphs. (arXiv:2110.12148v2 [cs.LG] UPDATED)
    Event detection is a critical task for timely decision-making in graph analytics applications. Despite the recent progress towards deep learning on graphs, event detection on dynamic graphs presents particular challenges to existing architectures. Real-life events are often associated with sudden deviations of the normal behavior of the graph. However, existing approaches for dynamic node embedding are unable to capture the graph-level dynamics related to events. In this paper, we propose DyGED, a simple yet novel deep learning model for event detection on dynamic graphs. DyGED learns correlations between the graph macro dynamics -- i.e. a sequence of graph-level representations -- and labeled events. Moreover, our approach combines structural and temporal self-attention mechanisms to account for application-specific node and time importances effectively. Our experimental evaluation, using a representative set of datasets, demonstrates that DyGED outperforms competing solutions in terms of event detection accuracy by up to 8.5% while being more scalable than the top alternatives. We also present case studies illustrating key features of our model.
    The Role of ImageNet Classes in Fr\'echet Inception Distance. (arXiv:2203.06026v3 [cs.CV] UPDATED)
    Fr\'echet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.
    How Many Data Samples is an Additional Instruction Worth?. (arXiv:2203.09161v3 [cs.CL] UPDATED)
    Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language. Instruction-tuned models have significantly outperformed multitask learning models (without instruction); however they are far from state-of-the-art task-specific models. Conventional approaches to improve model performance via creating datasets with large number of task instances or architectural changes in the model may not be feasible for non-expert users. However, they can write alternate instructions to represent an instruction task. Is Instruction-augmentation helpful? We augment a subset of tasks in the expanded version of NATURAL INSTRUCTIONS with additional instructions and find that it significantly improves model performance (up to 35%), especially in the low-data regime. Our results indicate that an additional instruction can be equivalent to ~200 data samples on average across tasks.
    Towards Effective and Robust Neural Trojan Defenses via Input Filtering. (arXiv:2202.12154v5 [cs.CR] UPDATED)
    Trojan attacks on deep neural networks are both dangerous and surreptitious. Over the past few years, Trojan attacks have advanced from using only a single input-agnostic trigger and targeting only one class to using multiple, input-specific triggers and targeting multiple classes. However, Trojan defenses have not caught up with this development. Most defense methods still make inadequate assumptions about Trojan triggers and target classes, thus, can be easily circumvented by modern Trojan attacks. To deal with this problem, we propose two novel "filtering" defenses called Variational Input Filtering (VIF) and Adversarial Input Filtering (AIF) which leverage lossy data compression and adversarial learning respectively to effectively purify potential Trojan triggers in the input at run time without making assumptions about the number of triggers/target classes or the input dependence property of triggers. In addition, we introduce a new defense mechanism called "Filtering-then-Contrasting" (FtC) which helps avoid the drop in classification accuracy on clean data caused by "filtering", and combine it with VIF/AIF to derive new defenses of this kind. Extensive experimental results and ablation studies show that our proposed defenses significantly outperform well-known baseline defenses in mitigating five advanced Trojan attacks including two recent state-of-the-art while being quite robust to small amounts of training data and large-norm triggers.
    EPISODE: Episodic Gradient Clipping with Periodic Resampled Corrections for Federated Learning with Heterogeneous Data. (arXiv:2302.07155v1 [cs.LG])
    Gradient clipping is an important technique for deep neural networks with exploding gradients, such as recurrent neural networks. Recent studies have shown that the loss functions of these networks do not satisfy the conventional smoothness condition, but instead satisfy a relaxed smoothness condition, i.e., the Lipschitz constant of the gradient scales linearly in terms of the gradient norm. Due to this observation, several gradient clipping algorithms have been developed for nonconvex and relaxed-smooth functions. However, the existing algorithms only apply to the single-machine or multiple-machine setting with homogeneous data across machines. It remains unclear how to design provably efficient gradient clipping algorithms in the general Federated Learning (FL) setting with heterogeneous data and limited communication rounds. In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting. The key ingredients of the algorithm are two new techniques called \textit{episodic gradient clipping} and \textit{periodic resampled corrections}. At the beginning of each round, EPISODE resamples stochastic gradients from each client and obtains the global averaged gradient, which is used to (1) determine whether to apply gradient clipping for the entire round and (2) construct local gradient corrections for each client. Notably, our algorithm and analysis provide a unified framework for both homogeneous and heterogeneous data under any noise level of the stochastic gradient, and it achieves state-of-the-art complexity results. In particular, we prove that EPISODE can achieve linear speedup in the number of machines, and it requires significantly fewer communication rounds. Experiments on several heterogeneous datasets show the superior performance of EPISODE over several strong baselines in FL.
    Characterizing notions of omniprediction via multicalibration. (arXiv:2302.06726v1 [cs.LG])
    A recent line of work shows that notions of multigroup fairness imply surprisingly strong notions of omniprediction: loss minimization guarantees that apply not just for a specific loss function, but for any loss belonging to a large family of losses. While prior work has derived various notions of omniprediction from multigroup fairness guarantees of varying strength, it was unknown whether the connection goes in both directions. In this work, we answer this question in the affirmative, establishing equivalences between notions of multicalibration and omniprediction. The new definitions that hold the key to this equivalence are new notions of swap omniprediction, which are inspired by swap regret in online learning. We show that these can be characterized exactly by a strengthening of multicalibration that we refer to as swap multicalibration. One can go from standard to swap multicalibration by a simple discretization; moreover all known algorithms for standard multicalibration in fact give swap multicalibration. In the context of omniprediction though, introducing the notion of swapping results in provably stronger notions, which require a predictor to minimize expected loss at least as well as an adaptive adversary who can choose both the loss function and hypothesis based on the value predicted by the predictor. Building on these characterizations, we paint a complete picture of the relationship between the various omniprediction notions in the literature by establishing implications and separations between them. Our work deepens our understanding of the connections between multigroup fairness, loss minimization and outcome indistinguishability and establishes new connections to classic notions in online learning.
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v3 [stat.ML] UPDATED)
    In this paper, we interpret disentanglement as the discovery of local charts and trace how that definition naturally leads to an equivalent condition for disentanglement: the disentangled factors must commute with each other. We discuss the practical and theoretical implications of commutativity, in particular the compression and disentanglement of generative models. Finally, we conclude with a discussion of related approaches to disentanglement and how they relate to our view of disentanglement from the manifold perspective.
    The Missing Margin: How Sample Corruption Affects Distance to the Boundary in ANNs. (arXiv:2302.06925v1 [cs.LG])
    Classification margins are commonly used to estimate the generalization ability of machine learning models. We present an empirical study of these margins in artificial neural networks. A global estimate of margin size is usually used in the literature. In this work, we point out seldom considered nuances regarding classification margins. Notably, we demonstrate that some types of training samples are modelled with consistently small margins while affecting generalization in different ways. By showing a link with the minimum distance to a different-target sample and the remoteness of samples from one another, we provide a plausible explanation for this observation. We support our findings with an analysis of fully-connected networks trained on noise-corrupted MNIST data, as well as convolutional networks trained on noise-corrupted CIFAR10 data.
    Multi-teacher knowledge distillation as an effective method for compressing ensembles of neural networks. (arXiv:2302.07215v1 [cs.LG])
    Deep learning has contributed greatly to many successes in artificial intelligence in recent years. Today, it is possible to train models that have thousands of layers and hundreds of billions of parameters. Large-scale deep models have achieved great success, but the enormous computational complexity and gigantic storage requirements make it extremely difficult to implement them in real-time applications. On the other hand, the size of the dataset is still a real problem in many domains. Data are often missing, too expensive, or impossible to obtain for other reasons. Ensemble learning is partially a solution to the problem of small datasets and overfitting. However, ensemble learning in its basic version is associated with a linear increase in computational complexity. We analyzed the impact of the ensemble decision-fusion mechanism and checked various methods of sharing the decisions including voting algorithms. We used the modified knowledge distillation framework as a decision-fusion mechanism which allows in addition compressing of the entire ensemble model into a weight space of a single model. We showed that knowledge distillation can aggregate knowledge from multiple teachers in only one student model and, with the same computational complexity, obtain a better-performing model compared to a model trained in the standard manner. We have developed our own method for mimicking the responses of all teachers at the same time, simultaneously. We tested these solutions on several benchmark datasets. In the end, we presented a wide application use of the efficient multi-teacher knowledge distillation framework. In the first example, we used knowledge distillation to develop models that could automate corrosion detection on aircraft fuselage. The second example describes detection of smoke on observation cameras in order to counteract wildfires in forests.
    Online Learning of Network Bottlenecks via Minimax Paths. (arXiv:2109.08467v3 [cs.LG] UPDATED)
    In this paper, we study bottleneck identification in networks via extracting minimax paths. Many real-world networks have stochastic weights for which full knowledge is not available in advance. Therefore, we model this task as a combinatorial semi-bandit problem to which we apply a combinatorial version of Thompson Sampling and establish an upper bound on the corresponding Bayesian regret. Due to the computational intractability of the problem, we then devise an alternative problem formulation which approximates the original objective. Finally, we experimentally evaluate the performance of Thompson Sampling with the approximate formulation on real-world directed and undirected networks.
    Boosted ab initio Cryo-EM 3D Reconstruction with ACE-EM. (arXiv:2302.06091v2 [cs.CV] UPDATED)
    The central problem in cryo-electron microscopy (cryo-EM) is to recover the 3D structure from noisy 2D projection images which requires estimating the missing projection angles (poses). Recent methods attempted to solve the 3D reconstruction problem with the autoencoder architecture, which suffers from the latent vector space sampling problem and frequently produces suboptimal pose inferences and inferior 3D reconstructions. Here we present an improved autoencoder architecture called ACE (Asymmetric Complementary autoEncoder), based on which we designed the ACE-EM method for cryo-EM 3D reconstructions. Compared to previous methods, ACE-EM reached higher pose space coverage within the same training time and boosted the reconstruction performance regardless of the choice of decoders. With this method, the Nyquist resolution (highest possible resolution) was reached for 3D reconstructions of both simulated and experimental cryo-EM datasets. Furthermore, ACE-EM is the only amortized inference method that reached the Nyquist resolution.
    An Experimental Study of Byzantine-Robust Aggregation Schemes in Federated Learning. (arXiv:2302.07173v1 [cs.LG])
    Byzantine-robust federated learning aims at mitigating Byzantine failures during the federated training process, where malicious participants may upload arbitrary local updates to the central server to degrade the performance of the global model. In recent years, several robust aggregation schemes have been proposed to defend against malicious updates from Byzantine clients and improve the robustness of federated learning. These solutions were claimed to be Byzantine-robust, under certain assumptions. Other than that, new attack strategies are emerging, striving to circumvent the defense schemes. However, there is a lack of systematic comparison and empirical study thereof. In this paper, we conduct an experimental study of Byzantine-robust aggregation schemes under different attacks using two popular algorithms in federated learning, FedSGD and FedAvg . We first survey existing Byzantine attack strategies and Byzantine-robust aggregation schemes that aim to defend against Byzantine attacks. We also propose a new scheme, ClippedClustering , to enhance the robustness of a clustering-based scheme by automatically clipping the updates. Then we provide an experimental evaluation of eight aggregation schemes in the scenario of five different Byzantine attacks. Our results show that these aggregation schemes sustain relatively high accuracy in some cases but are ineffective in others. In particular, our proposed ClippedClustering successfully defends against most attacks under independent and IID local datasets. However, when the local datasets are Non-IID, the performance of all the aggregation schemes significantly decreases. With Non-IID data, some of these aggregation schemes fail even in the complete absence of Byzantine clients. We conclude that the robustness of all the aggregation schemes is limited, highlighting the need for new defense strategies, in particular for Non-IID datasets.
    Anonymization for Skeleton Action Recognition. (arXiv:2111.15129v3 [cs.CV] UPDATED)
    Skeleton-based action recognition attracts practitioners and researchers due to the lightweight, compact nature of datasets. Compared with RGB-video-based action recognition, skeleton-based action recognition is a safer way to protect the privacy of subjects while having competitive recognition performance. However, due to improvements in skeleton recognition algorithms as well as motion and depth sensors, more details of motion characteristics can be preserved in the skeleton dataset, leading to potential privacy leakage. We first train classifiers to categorize private information from skeleton trajectories to investigate the potential privacy leakage from skeleton datasets. Our preliminary experiments show that the gender classifier achieves 87% accuracy on average, and the re-identification classifier achieves 80% accuracy on average with three baseline models: Shift-GCN, MS-G3D, and 2s-AGCN. We propose an anonymization framework based on adversarial learning to protect potential privacy leakage from the skeleton dataset. Experimental results show that an anonymized dataset can reduce the risk of privacy leakage while having marginal effects on action recognition performance even with simple anonymizer architectures. The code used in our experiments is available at https://github.com/ml-postech/Skeleton-anonymization/
    Linear Causal Disentanglement via Interventions. (arXiv:2211.16467v2 [stat.ML] UPDATED)
    Causal disentanglement seeks a representation of data involving latent variables that relate to one another via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement that accurately recovers a latent causal model.
    Online Learning of Energy Consumption for Navigation of Electric Vehicles. (arXiv:2111.02314v2 [cs.LG] UPDATED)
    Energy efficient navigation constitutes an important challenge in electric vehicles, due to their limited battery capacity. We employ a Bayesian approach to model the energy consumption at road segments for efficient navigation. In order to learn the model parameters, we develop an online learning framework and investigate several exploration strategies such as Thompson Sampling and Upper Confidence Bound. We then extend our online learning framework to the multi-agent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We analyze Thompson Sampling and establish rigorous regret bounds on its performance in the single-agent and multi-agent settings, through an analysis of the algorithm under batched feedback. Finally, we demonstrate the performance of our methods via experiments on several real-world city road networks.
    RamanNet: A generalized neural network architecture for Raman Spectrum Analysis. (arXiv:2201.09737v2 [cs.LG] UPDATED)
    Raman spectroscopy provides a vibrational profile of the molecules and thus can be used to uniquely identify different kind of materials. This sort of fingerprinting molecules has thus led to widespread application of Raman spectrum in various fields like medical dignostics, forensics, mineralogy, bacteriology and virology etc. Despite the recent rise in Raman spectra data volume, there has not been any significant effort in developing generalized machine learning methods for Raman spectra analysis. We examine, experiment and evaluate existing methods and conjecture that neither current sequential models nor traditional machine learning models are satisfactorily sufficient to analyze Raman spectra. Both has their perks and pitfalls, therefore we attempt to mix the best of both worlds and propose a novel network architecture RamanNet. RamanNet is immune to invariance property in CNN and at the same time better than traditional machine learning models for the inclusion of sparse connectivity. Our experiments on 4 public datasets demonstrate superior performance over the much complex state-of-the-art methods and thus RamanNet has the potential to become the defacto standard in Raman spectra data analysis
    Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks. (arXiv:2210.01019v2 [stat.ML] UPDATED)
    Monotonic linear interpolation (MLI) - on the line connecting a random initialization with the minimizer it converges to, the loss and accuracy are monotonic - is a phenomenon that is commonly observed in the training of neural networks. Such a phenomenon may seem to suggest that optimization of neural networks is easy. In this paper, we show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain). We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset using a simple model. Empirically we demonstrate that similar intuitions hold on practical networks and realistic datasets.
    An Exact Poly-Time Membership-Queries Algorithm for Extraction a three-Layer ReLU Network. (arXiv:2105.09673v4 [cs.LG] UPDATED)
    We consider the natural problem of learning a ReLU network from queries, which was recently remotivated by model extraction attacks. In this work, we present a polynomial-time algorithm that can learn a depth-two ReLU network from queries under mild general position assumptions. We also present a polynomial-time algorithm that, under mild general position assumptions, can learn a rich class of depth-three ReLU networks from queries. For instance, it can learn most networks where the number of first layer neurons is smaller than the dimension and the number of second layer neurons. These two results substantially improve state-of-the-art: Until our work, polynomial-time algorithms were only shown to learn from queries depth-two networks under the assumption that either the underlying distribution is Gaussian (Chen et al. (2021)) or that the weights matrix rows are linearly independent (Milli et al. (2019)). For depth three or more, there were no known poly-time results.
    Graph Embeddings via Tensor Products and Approximately Orthonormal Codes. (arXiv:2208.10917v3 [cs.SI] UPDATED)
    We introduce a method for embedding graphs as vectors in a structure-preserving manner, showcasing its rich representational capacity and giving some theoretical properties. Our procedure falls under the bind-and-sum approach, and we show that our binding operation - the tensor product - is the most general binding operation that respects the principle of superposition. We also establish some precise results characterizing the behavior of our method, and we show that our use of spherical codes achieves a packing upper bound. Then, we perform experiments showcasing our method's accuracy in various graph operations even when the number of edges is quite large. Finally, we establish a link to adjacency matrices, showing that our method is, in some sense, a generalization of adjacency matrices with applications towards large sparse graphs.
    How Does It Feel? Self-Supervised Costmap Learning for Off-Road Vehicle Traversability. (arXiv:2209.10788v2 [cs.RO] UPDATED)
    Estimating terrain traversability in off-road environments requires reasoning about complex interaction dynamics between the robot and these terrains. However, it is challenging to create informative labels to learn a model in a supervised manner for these interactions. We propose a method that learns to predict traversability costmaps by combining exteroceptive environmental information with proprioceptive terrain interaction feedback in a self-supervised manner. Additionally, we propose a novel way of incorporating robot velocity in the costmap prediction pipeline. We validate our method in multiple short and large-scale navigation tasks on challenging off-road terrains using two different large, all-terrain robots. Our short-scale navigation results show that using our learned costmaps leads to overall smoother navigation, and provides the robot with a more fine-grained understanding of the robot-terrain interactions. Our large-scale navigation trials show that we can reduce the number of interventions by up to 57\% compared to an occupancy-based navigation baseline in challenging off-road courses ranging from 400 m to 3150 m. Appendix and full experiment videos can be found in our website: https://mateoguaman.github.io/hdif.
    Deep Anatomical Federated Network (Dafne): an open client/server framework for the continuous collaborative improvement of deep-learning-based medical image segmentation. (arXiv:2302.06352v2 [eess.IV] UPDATED)
    Semantic segmentation is a crucial step to extract quantitative information from medical (and, specifically, radiological) images to aid the diagnostic process, clinical follow-up. and to generate biomarkers for clinical research. In recent years, machine learning algorithms have become the primary tool for this task. However, its real-world performance is heavily reliant on the comprehensiveness of training data. Dafne is the first decentralized, collaborative solution that implements continuously evolving deep learning models exploiting the collective knowledge of the users of the system. In the Dafne workflow, the result of each automated segmentation is refined by the user through an integrated interface, so that the new information is used to continuously expand the training pool via federated incremental learning. The models deployed through Dafne are able to improve their performance over time and to generalize to data types not seen in the training sets, thus becoming a viable and practical solution for real-life medical segmentation tasks.
    A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms. (arXiv:2206.14983v2 [cs.LG] UPDATED)
    Recent research increasingly brings to question the appropriateness of using predictive tools in complex, real-world tasks. While a growing body of work has explored ways to improve value alignment in these tools, comparatively less work has centered concerns around the fundamental justifiability of using these tools. This work seeks to center validity considerations in deliberations around whether and how to build data-driven algorithms in high-stakes domains. Toward this end, we translate key concepts from validity theory to predictive algorithms. We apply the lens of validity to re-examine common challenges in problem formulation and data issues that jeopardize the justifiability of using predictive algorithms and connect these challenges to the social science discourse around validity. Our interdisciplinary exposition clarifies how these concepts apply to algorithmic decision making contexts. We demonstrate how these validity considerations could distill into a series of high-level questions intended to promote and document reflections on the legitimacy of the predictive task and the suitability of the data.
    DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. (arXiv:2210.08933v3 [cs.CL] UPDATED)
    Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is under-explored due to the discrete nature of texts, especially for conditional generation. We tackle this challenge by proposing DiffuSeq: a diffusion model designed for sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DiffuSeq is its high diversity during generation, which is desired in many Seq2Seq tasks. We further include a theoretical analysis revealing the connection between DiffuSeq and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks. Code is available at \url{https://github.com/Shark-NLP/DiffuSeq}
    Robot Learning on the Job: Human-in-the-Loop Autonomy and Learning During Deployment. (arXiv:2211.08416v2 [cs.RO] UPDATED)
    With the rapid growth of computing powers and recent advances in deep learning, we have witnessed impressive demonstrations of novel robot capabilities in research settings. Nonetheless, these learning systems exhibit brittle generalization and require excessive training data for practical tasks. To harness the capabilities of state-of-the-art robot learning models while embracing their imperfections, we present Sirius, a principled framework for humans and robots to collaborate through a division of work. In this framework, partially autonomous robots are tasked with handling a major portion of decision-making where they work reliably; meanwhile, human operators monitor the process and intervene in challenging situations. Such a human-robot team ensures safe deployments in complex tasks. Further, we introduce a new learning algorithm to improve the policy's performance on the data collected from the task executions. The core idea is re-weighing training samples with approximated human trust and optimizing the policies with weighted behavioral cloning. We evaluate Sirius in simulation and on real hardware, showing that Sirius consistently outperforms baselines over a collection of contact-rich manipulation tasks, achieving an 8% boost in simulation and 27% on real hardware than the state-of-the-art methods, with twice faster convergence and 85% memory size reduction. Videos and code are available at https://ut-austin-rpl.github.io/sirius/
    Automated Reachability Analysis of Neural Network-Controlled Systems via Adaptive Polytopes. (arXiv:2212.07553v2 [eess.SY] UPDATED)
    Over-approximating the reachable sets of dynamical systems is a fundamental problem in safety verification and robust control synthesis. The representation of these sets is a key factor that affects the computational complexity and the approximation error. In this paper, we develop a new approach for over-approximating the reachable sets of neural network dynamical systems using adaptive template polytopes. We use the singular value decomposition of linear layers along with the shape of the activation functions to adapt the geometry of the polytopes at each time step to the geometry of the true reachable sets. We then propose a branch-and-bound method to compute accurate over-approximations of the reachable sets by the inferred templates. We illustrate the utility of the proposed approach in the reachability analysis of linear systems driven by neural network controllers.
    Integrated Sensing and Communication from Learning Perspective: An SDP3 Approach. (arXiv:2107.09621v3 [cs.IT] UPDATED)
    Characterizing the sensing and communication performance tradeoff in integrated sensing and communication (ISAC) systems is challenging in the applications of learning-based human motion recognition. This is because of the large experimental datasets and the black-box nature of deep neural networks. This paper presents SDP3, a Simulation-Driven Performance Predictor and oPtimizer, which consists of SDP3 data simulator, SDP3 performance predictor and SDP3 performance optimizer. Specifically, the SDP3 data simulator generates vivid wireless sensing datasets in a virtual environment, the SDP3 performance predictor predicts the sensing performance based on the function regression method, and the SDP3 performance optimizer investigates the sensing and communication performance tradeoff analytically. It is shown that the simulated sensing dataset matches the experimental dataset very well in the motion recognition accuracy. By leveraging SDP3, it is found that the achievable region of recognition accuracy and communication throughput consists of a communication saturation zone, a sensing saturation zone, and a communication-sensing adversarial zone, of which the desired balanced performance for ISAC systems lies in the third one.
    A Macrocolumn Architecture Implemented with Spiking Neurons. (arXiv:2207.05081v2 [cs.NE] UPDATED)
    The macrocolumn is a key component of a neuromorphic computing system that interacts with an external environment under control of an agent. Environments are learned and stored in the macrocolumn as labeled directed graphs where edges connect features and labels indicate the relative displacements between them. Macrocolumn functionality is first defined with a state machine model. This model is then implemented with a neural network composed of spiking neurons. The neuron model employs active dendrites and mirrors the Hawkins/Numenta neuron model. The architecture is demonstrated with a research benchmark in which an agent employs a macrocolumn to first learn and then navigate 2-d environments containing pseudo-randomly placed features.
    RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. (arXiv:2210.02885v2 [cs.LG] UPDATED)
    Joint-Embedding Self Supervised Learning (JE-SSL) has seen a rapid development, with the emergence of many method variations but only few principled guidelines that would help practitioners to successfully deploy them. The main reason for that pitfall comes from JE-SSL's core principle of not employing any input reconstruction therefore lacking visual cues of unsuccessful training. Adding non informative loss values to that, it becomes difficult to deploy SSL on a new dataset for which no labels can help to judge the quality of the learned representation. In this study, we develop a simple unsupervised criterion that is indicative of the quality of the learned JE-SSL representations: their effective rank. Albeit simple and computationally friendly, this method -- coined RankMe -- allows one to assess the performance of JE-SSL representations, even on different downstream datasets, without requiring any labels. A further benefit of RankMe is that it does not have any training or hyper-parameters to tune. Through thorough empirical experiments involving hundreds of training episodes, we demonstrate how RankMe can be used for hyperparameter selection with nearly no reduction in final performance compared to the current selection method that involve a dataset's labels. We hope that RankMe will facilitate the deployment of JE-SSL towards domains that do not have the opportunity to rely on labels for representations' quality assessment.
    Forget Unlearning: Towards True Data-Deletion in Machine Learning. (arXiv:2210.08911v2 [stat.ML] UPDATED)
    Unlearning algorithms aim to remove deleted data's influence from trained models at a cost lower than full retraining. However, prior guarantees of unlearning in literature are flawed and don't protect the privacy of deleted records. We show that when users delete their data as a function of published models, records in a database become interdependent. So, even retraining a fresh model after deletion of a record doesn't ensure its privacy. Secondly, unlearning algorithms that cache partial computations to speed up the processing can leak deleted information over a series of releases, violating the privacy of deleted records in the long run. To address these, we propose a sound deletion guarantee and show that the privacy of existing records is necessary for the privacy of deleted records. Under this notion, we propose an accurate, computationally efficient, and secure machine unlearning algorithm based on noisy gradient descent.
    Time-aware Random Walk Diffusion to Improve Dynamic Graph Learning. (arXiv:2211.01214v5 [cs.LG] UPDATED)
    How can we augment a dynamic graph for improving the performance of dynamic graph neural networks? Graph augmentation has been widely utilized to boost the learning performance of GNN-based models. However, most existing approaches only enhance spatial structure within an input static graph by transforming the graph, and do not consider dynamics caused by time such as temporal locality, i.e., recent edges are more influential than earlier ones, which remains challenging for dynamic graph augmentation. In this work, we propose TiaRa (Time-aware Random Walk Diffusion), a novel diffusion-based method for augmenting a dynamic graph represented as a discrete-time sequence of graph snapshots. For this purpose, we first design a time-aware random walk proximity so that a surfer can walk along the time dimension as well as edges, resulting in spatially and temporally localized scores. We then derive our diffusion matrices based on the time-aware random walk, and show they become enhanced adjacency matrices that both spatial and temporal localities are augmented. Throughout extensive experiments, we demonstrate that TiaRa effectively augments a given dynamic graph, and leads to significant improvements in dynamic GNN models for various graph datasets and tasks.
    Measuring incompatibility and clustering quantum observables with a quantum switch. (arXiv:2208.06210v2 [quant-ph] UPDATED)
    The existence of incompatible observables is a cornerstone of quantum mechanics and a valuable resource in quantum technologies. Here we introduce a measure of incompatibility, called the mutual eigenspace disturbance (MED), which quantifies the amount of disturbance induced by the measurement of a sharp observable on the eigenspaces of another. The MED provides a metric on the space of von Neumann measurements, and can be efficiently estimated by letting the measurement processes act in an indefinite order, using a setup known as the quantum switch, which also allows one to quantify the noncommutativity of arbitrary quantum processes. Thanks to these features, the MED can be used in quantum machine learning tasks. We demonstrate this application by providing an unsupervised algorithm that clusters unknown von Neumann measurements. Our algorithm is robust to noise can be used to identify groups of observers that share approximately the same measurement context.
    Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against Fact-Verification Systems. (arXiv:2209.03755v3 [cs.CR] UPDATED)
    Mis- and disinformation are a substantial global threat to our security and safety. To cope with the scale of online misinformation, researchers have been working on automating fact-checking by retrieving and verifying against relevant evidence. However, despite many advances, a comprehensive evaluation of the possible attack vectors against such systems is still lacking. Particularly, the automated fact-verification process might be vulnerable to the exact disinformation campaigns it is trying to combat. In this work, we assume an adversary that automatically tampers with the online evidence in order to disrupt the fact-checking model via camouflaging the relevant evidence or planting a misleading one. We first propose an exploratory taxonomy that spans these two targets and the different threat model dimensions. Guided by this, we design and propose several potential attack methods. We show that it is possible to subtly modify claim-salient snippets in the evidence and generate diverse and claim-aligned evidence. Thus, we highly degrade the fact-checking performance under many different permutations of the taxonomy's dimensions. The attacks are also robust against post-hoc modifications of the claim. Our analysis further hints at potential limitations in models' inference when faced with contradicting evidence. We emphasize that these attacks can have harmful implications on the inspectable and human-in-the-loop usage scenarios of such models, and conclude by discussing challenges and directions for future defenses.
    Parameter-Efficient Tuning with Special Token Adaptation. (arXiv:2210.04382v2 [cs.CL] UPDATED)
    Parameter-efficient tuning aims at updating only a small subset of parameters when adapting a pretrained model to downstream tasks. In this work, we introduce PASTA, in which we only modify the special token representations (e.g., [SEP] and [CLS] in BERT) before the self-attention module at each layer in Transformer-based models. PASTA achieves comparable performance to full finetuning in natural language understanding tasks including text classification and NER with up to only 0.029% of total parameters trained. Our work not only provides a simple yet effective way of parameter-efficient tuning, which has a wide range of practical applications when deploying finetuned models for multiple tasks, but also demonstrates the pivotal role of special tokens in pretrained language models
    Netizens, Academicians, and Information Professionals' Opinions About AI With Special Reference To ChatGPT. (arXiv:2302.07136v1 [cs.CY])
    This study aims to understand the perceptions and opinions of academicians towards ChatGPT-3 by collecting and analyzing social media comments, and a survey was conducted with library and information science professionals. The research uses a content analysis method and finds that while ChatGPT-3 can be a valuable tool for research and writing, it is not 100% accurate and should be cross-checked. The study also finds that while some academicians may not accept ChatGPT-3, most are starting to accept it. The study is beneficial for academicians, content developers, and librarians.
    Towards Interpretable Sleep Stage Classification Using Cross-Modal Transformers. (arXiv:2208.06991v2 [cs.LG] UPDATED)
    Accurate sleep stage classification is significant for sleep health assessment. In recent years, several machine-learning based sleep staging algorithms have been developed, and in particular, deep-learning based algorithms have achieved performance on par with human annotation. Despite the improved performance, a limitation of most deep-learning based algorithms is their black-box behavior, which has limited their use in clinical settings. Here, we propose a cross-modal transformer, which is a transformer-based method for sleep stage classification. The proposed cross-modal transformer consists of a novel cross-modal transformer encoder architecture along with a multi-scale one-dimensional convolutional neural network for automatic representation learning. Our method outperforms the state-of-the-art methods and eliminates the black-box behavior of deep-learning models by utilizing the interpretability aspect of the attention modules. Furthermore, our method provides considerable reductions in the number of parameters and training time compared to the state-of-the-art methods. Our code is available at https://github.com/Jathurshan0330/Cross-Modal-Transformer.
    Interpolation Learning With Minimum Description Length. (arXiv:2302.07263v1 [cs.LG])
    We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.
    Memorization-Dilation: Modeling Neural Collapse Under Label Noise. (arXiv:2206.05530v2 [cs.LG] UPDATED)
    The notion of neural collapse refers to several emergent phenomena that have been empirically observed across various canonical classification problems. During the terminal phase of training a deep neural network, the feature embedding of all examples of the same class tend to collapse to a single representation, and the features of different classes tend to separate as much as possible. Neural collapse is often studied through a simplified model, called the unconstrained feature representation, in which the model is assumed to have "infinite expressivity" and can map each data point to any arbitrary representation. In this work, we propose a more realistic variant of the unconstrained feature representation that takes the limited expressivity of the network into account. Empirical evidence suggests that the memorization of noisy data points leads to a degradation (dilation) of the neural collapse. Using a model of the memorization-dilation (M-D) phenomenon, we show one mechanism by which different losses lead to different performances of the trained network on noisy data. Our proofs reveal why label smoothing, a modification of cross-entropy empirically observed to produce a regularization effect, leads to improved generalization in classification tasks.
    Non-stationary Contextual Bandits and Universal Learning. (arXiv:2302.07186v1 [stat.ML])
    We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, [Blanchard et al.] characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly depending on the learner's actions. We show that optimistic universal learning for non-stationary contextual bandits is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various non-stationarity models, including online and adversarial reward mechanisms. In particular, the set of learnable processes for non-stationary rewards is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new non-stationary phenomena.
    Contrastive Multimodal Learning for Emergence of Graphical Sensory-Motor Communication. (arXiv:2210.06468v2 [cs.AI] UPDATED)
    In this paper, we investigate whether artificial agents can develop a shared language in an ecological setting where communication relies on a sensory-motor channel. To this end, we introduce the Graphical Referential Game (GREG) where a speaker must produce a graphical utterance to name a visual referent object while a listener has to select the corresponding object among distractor referents, given the delivered message. The utterances are drawing images produced using dynamical motor primitives combined with a sketching library. To tackle GREG we present CURVES: a multimodal contrastive deep learning mechanism that represents the energy (alignment) between named referents and utterances generated through gradient ascent on the learned energy landscape. We demonstrate that CURVES not only succeeds at solving the GREG but also enables agents to self-organize a language that generalizes to feature compositions never seen during training. In addition to evaluating the communication performance of our approach, we also explore the structure of the emerging language. Specifically, we show that the resulting language forms a coherent lexicon shared between agents and that basic compositional rules on the graphical productions could not explain the compositional generalization.
    Solution Path Algorithm for Twin Multi-class Support Vector Machine. (arXiv:2006.00276v2 [cs.LG] UPDATED)
    The twin support vector machine and its extensions have made great achievements in dealing with binary classification problems. However, it suffers from difficulties in effective solution of multi-classification and fast model selection. This work devotes to the fast regularization parameter tuning algorithm for the twin multi-class support vector machine. Specifically, a novel sample data set partition strategy is first adopted, which is the basis for the model construction. Then, combining the linear equations and block matrix theory, the Lagrangian multipliers are proved to be piecewise linear w.r.t. the regularization parameters, so that the regularization parameters are continuously updated by only solving the break points. Next, Lagrangian multipliers are proved to be 1 as the regularization parameter approaches infinity, thus, a simple yet effective initialization algorithm is devised. Finally, eight kinds of events are defined to seek for the starting event for the next iteration. Extensive experimental results on nine UCI data sets show that the proposed method can achieve comparable classification performance without solving any quadratic programming problem.  ( 2 min )
    Private Statistical Estimation of Many Quantiles. (arXiv:2302.06943v1 [stat.ML])
    This work studies the estimation of many statistical quantiles under differential privacy. More precisely, given a distribution and access to i.i.d. samples from it, we study the estimation of the inverse of its cumulative distribution function (the quantile function) at specific points. For instance, this task is of key importance in private data generation. We present two different approaches. The first one consists in privately estimating the empirical quantiles of the samples and using this result as an estimator of the quantiles of the distribution. In particular, we study the statistical properties of the recently published algorithm introduced by Kaplan et al. 2022 that privately estimates the quantiles recursively. The second approach is to use techniques of density estimation in order to uniformly estimate the quantile function on an interval. In particular, we show that there is a tradeoff between the two methods. When we want to estimate many quantiles, it is better to estimate the density rather than estimating the quantile function at specific points.  ( 2 min )
    PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. (arXiv:2302.07120v1 [cs.AI])
    Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified model to meet customized requirements may lead to the problem of negative transfer. Inspired by the success of multi-task learning in the NLP area, we use prefix embeddings to provide a novel generative model that considers both the targeted pocket's circumstances and a variety of chemical properties. All conditional information is represented as learnable features, which the generative model subsequently employs as a contextual prompt. Experiments show that our model exhibits good controllability in both single and multi-conditional molecular generation. The controllability enables us to outperform previous structure-based drug design methods. More interestingly, we open up the attention mechanism and reveal coupling relationships between conditions, providing guidance for multi-conditional molecule generation.
    Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search. (arXiv:2108.08910v2 [eess.IV] UPDATED)
    Though recent years have witnessed remarkable progress in single image super-resolution (SISR) tasks with the prosperous development of deep neural networks (DNNs), the deep learning methods are confronted with the computation and memory consumption issues in practice, especially for resource-limited platforms such as mobile devices. To overcome the challenge and facilitate the real-time deployment of SISR tasks on mobile, we combine neural architecture search with pruning search and propose an automatic search framework that derives sparse super-resolution (SR) models with high image quality while satisfying the real-time inference requirement. To decrease the search cost, we leverage the weight sharing strategy by introducing a supernet and decouple the search problem into three stages, including supernet construction, compiler-aware architecture and pruning search, and compiler-aware pruning ratio search. With the proposed framework, we are the first to achieve real-time SR inference (with only tens of milliseconds per frame) for implementing 720p resolution with competitive image quality (in terms of PSNR and SSIM) on mobile platforms (Samsung Galaxy S20).
    Automatic Segmentation of Aircraft Dents in Point Clouds. (arXiv:2205.01614v2 [cs.CV] UPDATED)
    Dents on the aircraft skin are frequent and may easily go undetected during airworthiness checks, as their inspection process is tedious and extremely subject to human factors and environmental conditions. Nowadays, 3D scanning technologies are being proposed for more reliable, human-independent measurements, yet the process of inspection and reporting remains laborious and time consuming because data acquisition and validation are still carried out by the engineer. For full automation of dent inspection, the acquired point cloud data must be analysed via a reliable segmentation algorithm, releasing humans from the search and evaluation of damage. This paper reports on two developments towards automated dent inspection. The first is a method to generate a synthetic dataset of dented surfaces to train a fully convolutional neural network. The training of machine learning algorithms needs a substantial volume of dent data, which is not readily available. Dents are thus simulated in random positions and shapes, within criteria and definitions of a Boeing 737 structural repair manual. The noise distribution from the scanning apparatus is then added to reflect the complete process of 3D point acquisition on the training. The second proposition is a surface fitting strategy to convert 3D point clouds to 2.5D. This allows higher resolution point clouds to be processed with a small amount of memory compared with state-of-the-art methods involving 3D sampling approaches. Simulations with available ground truth data show that the proposed technique reaches an intersection-over-union of over 80%. Experiments over dent samples prove an effective detection of dents with a speed of over 500 000 points per second.
    Robust Deep Reinforcement Learning through Regret Neighborhoods. (arXiv:2302.06912v1 [cs.LG])
    Deep Reinforcement Learning (DRL) policies have been shown to be vulnerable to small adversarial noise in observations. Such adversarial noise can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about nearby signs (e.g., a stop sign physically altered to be perceived as a speed limit sign) or objects (e.g., cars altered to be recognized as trees) can be fatal. Existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on reactive approaches that iteratively improve against adversarial examples generated at each iteration. While such approaches have been shown to provide improvements over regular RL methods, they are reactive and can fare significantly worse if certain categories of adversarial examples are not generated during training. To that end, we pursue a more proactive approach that relies on directly optimizing a well-studied robustness measure, regret instead of expected value. We provide a principled approach that minimizes maximum regret over a "neighborhood" of observations to the received "observation". Our regret criterion can be used to modify existing value- and policy-based Deep RL methods. We demonstrate that our approaches provide a significant improvement in performance across a wide variety of benchmarks against leading approaches for robust Deep RL.
    SOAR: Simultaneous Or of And Rules for Classification of Positive & Negative Classes. (arXiv:2008.11249v3 [stat.ML] UPDATED)
    Algorithmic decision making has proliferated and now impacts our daily lives in both mundane and consequential ways. Machine learning practitioners make use of a myriad of algorithms for predictive models in applications as diverse as movie recommendations, medical diagnoses, and parole recommendations without delving into the reasons driving specific predictive decisions. Machine learning algorithms in such applications are often chosen for their superior performance, however popular choices such as random forest and deep neural networks fail to provide an interpretable understanding of the predictive model. In recent years, rule-based algorithms have been used to address this issue. Wang et al. (2017) presented an or-of-and (disjunctive normal form) based classification technique that allows for classification rule mining of a single class in a binary classification; this method is also shown to perform comparably to other modern algorithms. In this work, we extend this idea to provide classification rules for both classes simultaneously. That is, we provide a distinct set of rules for both positive and negative classes. In describing this approach, we also present a novel and complete taxonomy of classifications that clearly capture and quantify the inherent ambiguity in noisy binary classifications in the real world. We show that this approach leads to a more granular formulation of the likelihood model and a simulated-annealing based optimization achieves classification performance competitive with comparable techniques. We apply our method to synthetic as well as real world data sets to compare with other related methods that demonstrate the utility of our proposal.
    A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with Confounding Correction. (arXiv:1711.04162v2 [cs.LG] UPDATED)
    While linear mixed model (LMM) has shown a competitive performance in correcting spurious associations raised by population stratification, family structures, and cryptic relatedness, more challenges are still to be addressed regarding the complex structure of genotypic and phenotypic data. For example, geneticists have discovered that some clusters of phenotypes are more co-expressed than others. Hence, a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial for genetic modeling. We proposed the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction. Our method is capable of uncovering the genetic associations of a large number of phenotypes together while considering the relatedness of these phenotypes. Through extensive simulation experiments, we show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals. Further, we validate the effectiveness of sGLMM in the real-world genomic dataset on two different species from plants and humans. In Arabidopsis thaliana data, sGLMM behaves better than all other baseline models for 63.4% traits. We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.
    Bridge the Gap Between CV and NLP! An Optimization-based Textual Adversarial Attack Framework. (arXiv:2110.15317v3 [cs.CL] UPDATED)
    Despite recent success on various tasks, deep learning techniques still perform poorly on adversarial examples with small perturbations. While optimization-based methods for adversarial attacks are well-explored in the field of computer vision, it is impractical to directly apply them in natural language processing due to the discrete nature of the text. To address the problem, we propose a unified framework to extend the existing optimization-based adversarial attack methods in the vision domain to craft textual adversarial samples. In this framework, continuously optimized perturbations are added to the embedding layer and amplified in the forward propagation process. Then the final perturbed latent representations are decoded with a masked language model head to obtain potential adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected Gradient Descent (T-PGD). We find our algorithm effective even using proxy gradient information. Therefore, we perform the more challenging transfer black-box attack and conduct comprehensive experiments to evaluate our attack algorithm with several models on three benchmark datasets. Experimental results demonstrate that our method achieves an overall better performance and produces more fluent and grammatical adversarial samples compared to strong baseline methods. All the code and data will be made public.
    CoTV: Cooperative Control for Traffic Light Signals and Connected Autonomous Vehicles using Deep Reinforcement Learning. (arXiv:2201.13143v2 [cs.AI] UPDATED)
    The target of reducing travel time only is insufficient to support the development of future smart transportation systems. To align with the United Nations Sustainable Development Goals (UN-SDG), a further reduction of fuel and emissions, improvements of traffic safety, and the ease of infrastructure deployment and maintenance should also be considered. Different from existing work focusing on the optimization of the control in either traffic light signal (to improve the intersection throughput), or vehicle speed (to stabilize the traffic), this paper presents a multi-agent Deep Reinforcement Learning (DRL) system called CoTV, which Cooperatively controls both Traffic light signals and Connected Autonomous Vehicles (CAV). Therefore, our CoTV can well balance the achievement of the reduction of travel time, fuel, and emissions. In the meantime, CoTV can also be easy to deploy by cooperating with only one CAV that is the nearest to the traffic light controller on each incoming road. This enables more efficient coordination between traffic light controllers and CAV, thus leading to the convergence of training CoTV under the large-scale multi-agent scenario that is traditionally difficult to converge. We give the detailed system design of CoTV and demonstrate its effectiveness in a simulation study using SUMO under various grid maps and realistic urban scenarios with mixed-autonomy traffic.
    Cauchy Loss Function: Robustness Under Gaussian and Cauchy Noise. (arXiv:2302.07238v1 [cs.LG])
    In supervised machine learning, the choice of loss function implicitly assumes a particular noise distribution over the data. For example, the frequently used mean squared error (MSE) loss assumes a Gaussian noise distribution. The choice of loss function during training and testing affects the performance of artificial neural networks (ANNs). It is known that MSE may yield substandard performance in the presence of outliers. The Cauchy loss function (CLF) assumes a Cauchy noise distribution, and is therefore potentially better suited for data with outliers. This papers aims to determine the extent of robustness and generalisability of the CLF as compared to MSE. CLF and MSE are assessed on a few handcrafted regression problems, and a real-world regression problem with artificially simulated outliers, in the context of ANN training. CLF yielded results that were either comparable to or better than the results yielded by MSE, with a few notable exceptions.
    Learning Graph ARMA Processes from Time-Vertex Spectra. (arXiv:2302.06887v1 [stat.ML])
    The modeling of time-varying graph signals as stationary time-vertex stochastic processes permits the inference of missing signal values by efficiently employing the correlation patterns of the process across different graph nodes and time instants. In this study, we first propose an algorithm for computing graph autoregressive moving average (graph ARMA) processes based on learning the joint time-vertex power spectral density of the process from its incomplete realizations. Our solution relies on first roughly estimating the joint spectrum of the process from partially observed realizations and then refining this estimate by projecting it onto the spectrum manifold of the ARMA process. We then present a theoretical analysis of the sample complexity of learning graph ARMA processes. Experimental results show that the proposed approach achieves improvement in the time-vertex signal estimation performance in comparison with reference approaches in the literature.
    Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent. (arXiv:2302.07125v1 [math.PR])
    We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate - infinite width scaling regime.
    Tetris-inspired detector with neural network for radiation mapping. (arXiv:2302.07099v1 [physics.ins-det])
    In recent years, radiation mapping has attracted widespread research attention and increased public concerns on environmental monitoring. In terms of both materials and their configurations, radiation detectors have been developed to locate the directions and positions of the radiation sources. In this process, algorithm is essential in converting detector signals to radiation source information. However, due to the complex mechanisms of radiation-matter interaction and the current limitation of data collection, high-performance, low-cost radiation mapping is still challenging. Here we present a computational framework using Tetris-inspired detector pixels and machine learning for radiation mapping. Using inter-pixel padding to increase the contrast between pixels and neural network to analyze the detector readings, a detector with as few as four pixels can achieve high-resolution directional mapping. By further imposing Maximum a Posteriori (MAP) with a moving detector, further radiation position localization is achieved. Non-square, Tetris-shaped detector can further improve performance beyond the conventional grid-shaped detector. Our framework offers a new avenue for high quality radiation mapping with least number of detector pixels possible, and is anticipated to be capable to deploy for real-world radiation detection with moderate validation.
    Energy Transformer. (arXiv:2302.07253v1 [cs.LG])
    Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it's empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.
    Where to Diffuse, How to Diffuse, and How to Get Back: Automated Learning for Multivariate Diffusions. (arXiv:2302.07261v1 [cs.LG])
    Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this inference diffusion process to generate samples. The choice of inference diffusion affects both likelihoods and sample quality. For example, extending the inference process with auxiliary variables leads to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study Multivariate Diffusion Models (MDMs). For any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDMs likelihood without requiring any model-specific analysis. We then demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over all linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST, CIFAR10, and ImageNet32 datasets. We show learned MDMs match or surpass bits-per-dims (BPDs) relative to fixed choices of diffusions for a given dataset and model architecture.
    Scalable Bayesian optimization with high-dimensional outputs using randomized prior networks. (arXiv:2302.07260v1 [cs.LG])
    Several fundamental problems in science and engineering consist of global optimization tasks involving unknown high-dimensional (black-box) functions that map a set of controllable variables to the outcomes of an expensive experiment. Bayesian Optimization (BO) techniques are known to be effective in tackling global optimization problems using a relatively small number objective function evaluations, but their performance suffers when dealing with high-dimensional outputs. To overcome the major challenge of dimensionality, here we propose a deep learning framework for BO and sequential decision making based on bootstrapped ensembles of neural architectures with randomized priors. Using appropriate architecture choices, we show that the proposed framework can approximate functional relationships between design variables and quantities of interest, even in cases where the latter take values in high-dimensional vector spaces or even infinite-dimensional function spaces. In the context of BO, we augmented the proposed probabilistic surrogates with re-parameterized Monte Carlo approximations of multiple-point (parallel) acquisition functions, as well as methodological extensions for accommodating black-box constraints and multi-fidelity information sources. We test the proposed framework against state-of-the-art methods for BO and demonstrate superior performance across several challenging tasks with high-dimensional outputs, including a constrained optimization task involving shape optimization of rotor blades in turbo-machinery.
    Randomization for adversarial robustness: the Good, the Bad and the Ugly. (arXiv:2302.07221v1 [cs.LG])
    Deep neural networks are known to be vulnerable to adversarial attacks: A small perturbation that is imperceptible to a human can easily make a well-trained deep neural network misclassify. To defend against adversarial attacks, randomized classifiers have been proposed as a robust alternative to deterministic ones. In this work we show that in the binary classification setting, for any randomized classifier, there is always a deterministic classifier with better adversarial risk. In other words, randomization is not necessary for robustness. In many common randomization schemes, the deterministic classifiers with better risk are explicitly described: For example, we show that ensembles of classifiers are more robust than mixtures of classifiers, and randomized smoothing is more robust than input noise injection. Finally, experiments confirm our theoretical results with the two families of randomized classifiers we analyze.
    Accelerated Fuzzy C-Means Clustering Based on New Affinity Filtering and Membership Scaling. (arXiv:2302.07060v1 [cs.LG])
    Fuzzy C-Means (FCM) is a widely used clustering method. However, FCM and its many accelerated variants have low efficiency in the mid-to-late stage of the clustering process. In this stage, all samples are involved in the update of their non-affinity centers, and the fuzzy membership grades of the most of samples, whose assignment is unchanged, are still updated by calculating the samples-centers distances. All those lead to the algorithms converging slowly. In this paper, a new affinity filtering technique is developed to recognize a complete set of the non-affinity centers for each sample with low computations. Then, a new membership scaling technique is suggested to set the membership grades between each sample and its non-affinity centers to 0 and maintain the fuzzy membership grades for others. By integrating those two techniques, FCM based on new affinity filtering and membership scaling (AMFCM) is proposed to accelerate the whole convergence process of FCM. Many experimental results performed on synthetic and real-world data sets have shown the feasibility and efficiency of the proposed algorithm. Compared with the state-of-the-art algorithms, AMFCM is significantly faster and more effective. For example, AMFCM reduces the number of the iteration of FCM by 80% on average.
    The Impact of Twitter Sentiments on Stock Market Trends. (arXiv:2302.07244v1 [cs.LG])
    The Web is a vast virtual space where people can share their opinions, impacting all aspects of life and having implications for marketing and communication. The most up-to-date and comprehensive information can be found on social media because of how widespread and straightforward it is to post a message. Proportionately, they are regarded as a valuable resource for making precise market predictions. In particular, Twitter has developed into a potent tool for understanding user sentiment. This article examines how well tweets can influence stock symbol trends. We analyze the volume, sentiment, and mentions of the top five stock symbols in the S&P 500 index on Twitter over three months. Long Short-Term Memory, Bernoulli Na\"ive Bayes, and Random Forest were the three algorithms implemented in this process. Our study revealed a significant correlation between stock prices and Twitter sentiment.
    Discrete fully probabilistic design: towards a control pipeline for the synthesis of policies from examples. (arXiv:2112.11210v2 [eess.SY] UPDATED)
    We present the principled design of a control pipeline for the synthesis of policies from examples data. The pipeline, based on a discretized design which we term as discrete fully probabilistic design, expounds an algorithm recently introduced in Gagliardi and Russo (2021) to synthesize policies from examples for constrained, stochastic and nonlinear systems. Contrary to other approaches, the pipeline we present: (i) does not need the constraints to be fulfilled in the possibly noisy example data; (ii) enables control synthesis even when the data are collected from an example system that is different from the one under control. The design is benchmarked numerically on an example that involves controlling an inverted pendulum with actuation constraints starting from data collected from a physically different pendulum that does not satisfy the system-specific actuation constraints. We also make our fully documented code openly available.
    Universal Guidance for Diffusion Models. (arXiv:2302.07121v1 [cs.CV])
    Typical diffusion models are trained to accept a particular form of conditioning, most commonly text, and cannot be conditioned on other modalities without retraining. In this work, we propose a universal guidance algorithm that enables diffusion models to be controlled by arbitrary guidance modalities without the need to retrain any use-specific components. We show that our algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals. Code is available at https://github.com/arpitbansal297/Universal-Guided-Diffusion.
    Lessons from the Development of an Anomaly Detection Interface on the Mars Perseverance Rover using the ISHMAP Framework. (arXiv:2302.07187v1 [cs.HC])
    While anomaly detection stands among the most important and valuable problems across many scientific domains, anomaly detection research often focuses on AI methods that can lack the nuance and interpretability so critical to conducting scientific inquiry. In this application paper we present the results of utilizing an alternative approach that situates the mathematical framing of machine learning based anomaly detection within a participatory design framework. In a collaboration with NASA scientists working with the PIXL instrument studying Martian planetary geochemistry as a part of the search for extra-terrestrial life; we report on over 18 months of in-context user research and co-design to define the key problems NASA scientists face when looking to detect and interpret spectral anomalies. We address these problems and develop a novel spectral anomaly detection toolkit for PIXL scientists that is highly accurate while maintaining strong transparency to scientific interpretation. We also describe outcomes from a yearlong field deployment of the algorithm and associated interface. Finally we introduce a new design framework which we developed through the course of this collaboration for co-creating anomaly detection algorithms: Iterative Semantic Heuristic Modeling of Anomalous Phenomena (ISHMAP), which provides a process for scientists and researchers to produce natively interpretable anomaly detection models. This work showcases an example of successfully bridging methodologies from AI and HCI within a scientific domain, and provides a resource in ISHMAP which may be used by other researchers and practitioners looking to partner with other scientific teams to achieve better science through more effective and interpretable anomaly detection tools.
    Multi-Prototypes Convex Merging Based K-Means Clustering Algorithm. (arXiv:2302.07045v1 [cs.LG])
    K-Means algorithm is a popular clustering method. However, it has two limitations: 1) it gets stuck easily in spurious local minima, and 2) the number of clusters k has to be given a priori. To solve these two issues, a multi-prototypes convex merging based K-Means clustering algorithm (MCKM) is presented. First, based on the structure of the spurious local minima of the K-Means problem, a multi-prototypes sampling (MPS) is designed to select the appropriate number of multi-prototypes for data with arbitrary shapes. A theoretical proof is given to guarantee that the multi-prototypes selected by MPS can achieve a constant factor approximation to the optimal cost of the K-Means problem. Then, a merging technique, called convex merging (CM), merges the multi-prototypes to get a better local minima without k being given a priori. Specifically, CM can obtain the optimal merging and estimate the correct k. By integrating these two techniques with K-Means algorithm, the proposed MCKM is an efficient and explainable clustering algorithm for escaping the undesirable local minima of K-Means problem without given k first. Experimental results performed on synthetic and real-world data sets have verified the effectiveness of the proposed algorithm.  ( 2 min )
    A Complete Expressiveness Hierarchy for Subgraph GNNs via Subgraph Weisfeiler-Lehman Tests. (arXiv:2302.07090v1 [cs.LG])
    Recently, subgraph GNNs have emerged as an important direction for developing expressive graph neural networks (GNNs). While numerous architectures have been proposed, so far there is still a limited understanding of how various design paradigms differ in terms of expressive power, nor is it clear what design principle achieves maximal expressiveness with minimal architectural complexity. Targeting these fundamental questions, this paper conducts a systematic study of general node-based subgraph GNNs through the lens of Subgraph Weisfeiler-Lehman Tests (SWL). Our central result is to build a complete hierarchy of SWL with strictly growing expressivity. Concretely, we prove that any node-based subgraph GNN falls into one of the six SWL equivalence classes, among which $\mathsf{SSWL}$ achieves the maximal expressive power. We also study how these equivalence classes differ in terms of their practical expressiveness such as encoding graph distance and biconnectivity. In addition, we give a tight expressivity upper bound of all SWL algorithms by establishing a close relation with localized versions of Folklore WL tests (FWL). Overall, our results provide insights into the power of existing subgraph GNNs, guide the design of new architectures, and point out their limitations by revealing an inherent gap with the 2-FWL test. Finally, experiments on the ZINC benchmark demonstrate that $\mathsf{SSWL}$-inspired subgraph GNNs can significantly outperform prior architectures despite great simplicity.
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v1 [stat.ML])
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.
    Solar Wind Speed Estimate with Machine Learning Ensemble Models for LISA. (arXiv:2302.06740v1 [astro-ph.HE])
    In this work we study the potentialities of machine learning models in reconstructing the solar wind speed observations gathered in the first Lagrangian point by the ACE satellite in 2016--2017 using as input data galactic cosmic-ray flux variations measured with particle detectors hosted onboard the LISA Pathfinder mission also orbiting around L1 during the same years. We show that ensemble models composed of heterogeneous weak regressors are able to outperform weak regressors in terms of predictive accuracy. Machine learning and other powerful predictive algorithms open a window on the possibility of substituting dedicated instrumentation with software models acting as surrogates for diagnostics of space missions such as LISA and space weather science.
    Effective Dimension in Bandit Problems under Censorship. (arXiv:2302.06916v1 [cs.LG])
    In this paper, we study both multi-armed and contextual bandit problems in censored environments. Our goal is to estimate the performance loss due to censorship in the context of classical algorithms designed for uncensored environments. Our main contributions include the introduction of a broad class of censorship models and their analysis in terms of the effective dimension of the problem -- a natural measure of its underlying statistical complexity and main driver of the regret bound. In particular, the effective dimension allows us to maintain the structure of the original problem at first order, while embedding it in a bigger space, and thus naturally leads to results analogous to uncensored settings. Our analysis involves a continuous generalization of the Elliptical Potential Inequality, which we believe is of independent interest. We also discover an interesting property of decision-making under censorship: a transient phase during which initial misspecification of censorship is self-corrected at an extra cost, followed by a stationary phase that reflects the inherent slowdown of learning governed by the effective dimension. Our results are useful for applications of sequential decision-making models where the feedback received depends on strategic uncertainty (e.g., agents' willingness to follow a recommendation) and/or random uncertainty (e.g., loss or delay in arrival of information).
    SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs. (arXiv:2302.07036v1 [cs.AR])
    The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5x, 90x, and 91x in frames-per-second (FPS), FPS/W and FPS/W/mm2, respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git).
    ALDI++: Automatic and parameter-less discord and outlier detection for building energy load profiles. (arXiv:2203.06618v2 [cs.LG] UPDATED)
    Data-driven building energy prediction is an integral part of the process for measurement and verification, building benchmarking, and building-to-grid interaction. The ASHRAE Great Energy Predictor III (GEPIII) machine learning competition used an extensive meter data set to crowdsource the most accurate machine learning workflow for whole building energy prediction. A significant component of the winning solutions was the pre-processing phase to remove anomalous training data. Contemporary pre-processing methods focus on filtering statistical threshold values or deep learning methods requiring training data and multiple hyper-parameters. A recent method named ALDI (Automated Load profile Discord Identification) managed to identify these discords using matrix profile, but the technique still requires user-defined parameters. We develop ALDI++, a method based on the previous work that bypasses user-defined parameters and takes advantage of discord similarity. We evaluate ALDI++ against a statistical threshold, variational auto-encoder, and the original ALDI as baselines in classifying discords and energy forecasting scenarios. Our results demonstrate that while the classification performance improvement over the original method is marginal, ALDI++ helps achieve the best forecasting error improving 6% over the winning's team approach with six times less computation time.
    SpeckleNN: A unified embedding for real-time speckle pattern classification in X-ray single-particle imaging with limited labeled examples. (arXiv:2302.06895v1 [cs.LG])
    With X-ray free-electron lasers (XFELs), it is possible to determine the three-dimensional structure of noncrystalline nanoscale particles using X-ray single-particle imaging (SPI) techniques at room temperature. Classifying SPI scattering patterns, or "speckles", to extract single hits that are needed for real-time vetoing and three-dimensional reconstruction poses a challenge for high data rate facilities like European XFEL and LCLS-II-HE. Here, we introduce SpeckleNN, a unified embedding model for real-time speckle pattern classification with limited labeled examples that can scale linearly with dataset size. Trained with twin neural networks, SpeckleNN maps speckle patterns to a unified embedding vector space, where similarity is measured by Euclidean distance. We highlight its few-shot classification capability on new never-seen samples and its robust performance despite only tens of labels per classification category even in the presence of substantial missing detector areas. Without the need for excessive manual labeling or even a full detector image, our classification method offers a great solution for real-time high-throughput SPI experiments.
    Do Neural Networks Generalize from Self-Averaging Sub-classifiers in the Same Way As Adaptive Boosting?. (arXiv:2302.06923v1 [cs.LG])
    In recent years, neural networks (NNs) have made giant leaps in a wide variety of domains. NNs are often referred to as black box algorithms due to how little we can explain their empirical success. Our foundational research seeks to explain why neural networks generalize. A recent advancement derived a mutual information measure for explaining the performance of deep NNs through a sequence of increasingly complex functions. We show deep NNs learn a series of boosted classifiers whose generalization is popularly attributed to self-averaging over an increasing number of interpolating sub-classifiers. To our knowledge, we are the first authors to establish the connection between generalization in boosted classifiers and generalization in deep NNs. Our experimental evidence and theoretical analysis suggest NNs trained with dropout exhibit similar self-averaging behavior over interpolating sub-classifiers as cited in popular explanations for the post-interpolation generalization phenomenon in boosting.
    Parameters for > 300 million Gaia stars: Bayesian inference vs. machine learning. (arXiv:2302.06995v1 [astro-ph.GA])
    The Gaia Data Release 3 (DR3), published in June 2022, delivers a diverse set of astrometric, photometric, and spectroscopic measurements for more than a billion stars. The wealth and complexity of the data makes traditional approaches for estimating stellar parameters for the full Gaia dataset almost prohibitive. We have explored different supervised learning methods for extracting basic stellar parameters as well as distances and line-of-sight extinctions, given spectro-photo-astrometric data (including also the new Gaia XP spectra). For training we use an enhanced high-quality dataset compiled from Gaia DR3 and ground-based spectroscopic survey data covering the whole sky and all Galactic components. We show that even with a simple neural-network architecture or tree-based algorithm (and in the absence of Gaia XP spectra), we succeed in predicting competitive results (compared to Bayesian isochrone fitting) down to faint magnitudes. We will present a new Gaia DR3 stellar-parameter catalogue obtained using the currently best-performing machine-learning algorithm for tabular data, XGBoost, in the near future.
    Concentration Bounds for Discrete Distribution Estimation in KL Divergence. (arXiv:2302.06869v1 [stat.ML])
    We study the problem of discrete distribution estimation in KL divergence and provide concentration bounds for the Laplace estimator. We show that the deviation from mean scales as $\sqrt{k}/n$ when $n \ge k$, improving upon the best prior result of $k/n$. We also establish a matching lower bound that shows that our bounds are tight up to polylogarithmic factors.
    Exploring Category Structure with Contextual Language Models and Lexical Semantic Networks. (arXiv:2302.06942v1 [cs.LG])
    Recent work on predicting category structure with distributional models, using either static word embeddings (Heyman and Heyman, 2019) or contextualized language models (CLMs) (Misra et al., 2021), report low correlations with human ratings, thus calling into question their plausibility as models of human semantic memory. In this work, we revisit this question testing a wider array of methods for probing CLMs for predicting typicality scores. Our experiments, using BERT (Devlin et al., 2018), show the importance of using the right type of CLM probes, as our best BERT-based typicality prediction methods substantially improve over previous works. Second, our results highlight the importance of polysemy in this task: our best results are obtained when using a disambiguation mechanism. Finally, additional experiments reveal that Information Contentbased WordNet (Miller, 1995), also endowed with disambiguation, match the performance of the best BERT-based method, and in fact capture complementary information, which can be combined with BERT to achieve enhanced typicality predictions.
    Improved Learning-Augmented Algorithms for the Multi-Option Ski Rental Problem via Best-Possible Competitive Analysis. (arXiv:2302.06832v1 [cs.DS])
    In this paper, we present improved learning-augmented algorithms for the multi-option ski rental problem. Learning-augmented algorithms take ML predictions as an added part of the input and incorporates these predictions in solving the given problem. Due to their unique strength that combines the power of ML predictions with rigorous performance guarantees, they have been extensively studied in the context of online optimization problems. Even though ski rental problems are one of the canonical problems in the field of online optimization, only deterministic algorithms were previously known for multi-option ski rental, with or without learning augmentation. We present the first randomized learning-augmented algorithm for this problem, surpassing previous performance guarantees given by deterministic algorithms. Our learning-augmented algorithm is based on a new, provably best-possible randomized competitive algorithm for the problem. Our results are further complemented by lower bounds for deterministic and randomized algorithms, and computational experiments evaluating our algorithms' performance improvements.
    Message Passing Meets Graph Neural Networks: A New Paradigm for Massive MIMO Systems. (arXiv:2302.06896v1 [cs.IT])
    As one of the core technologies for 5G systems, massive multiple-input multiple-output (MIMO) introduces dramatic capacity improvements along with very high beamforming and spatial multiplexing gains. When developing efficient physical layer algorithms for massive MIMO systems, message passing is one promising candidate owing to the superior performance. However, as their computational complexity increases dramatically with the problem size, the state-of-the-art message passing algorithms cannot be directly applied to future 6G systems, where an exceedingly large number of antennas are expected to be deployed. To address this issue, we propose a model-driven deep learning (DL) framework, namely the AMP-GNN for massive MIMO transceiver design, by considering the low complexity of the AMP algorithm and adaptability of GNNs. Specifically, the structure of the AMP-GNN network is customized by unfolding the approximate message passing (AMP) algorithm and introducing a graph neural network (GNN) module into it. The permutation equivariance property of AMP-GNN is proved, which enables the AMP-GNN to learn more efficiently and to adapt to different numbers of users. We also reveal the underlying reason why GNNs improve the AMP algorithm from the perspective of expectation propagation, which motivates us to amalgamate various GNNs with different message passing algorithms. In the simulation, we take the massive MIMO detection to exemplify that the proposed AMP-GNN significantly improves the performance of the AMP detector, achieves comparable performance as the state-of-the-art DL-based MIMO detectors, and presents strong robustness to various mismatches.
    Conservative State Value Estimation for Offline Reinforcement Learning. (arXiv:2302.06884v1 [cs.LG])
    Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective in-data policy optimization with conservative value guarantees. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states \emph{around} the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
    DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model. (arXiv:2302.06908v1 [cs.CV])
    Synthesizing face images from monochrome sketches is one of the most fundamental tasks in the field of image-to-image translation. However, it is still challenging to (1)~make models learn the high-dimensional face features such as geometry and color, and (2)~take into account the characteristics of input sketches. Existing methods often use sketches as indirect inputs (or as auxiliary inputs) to guide the models, resulting in the loss of sketch features or the alteration of geometry information. In this paper, we introduce a Sketch-Guided Latent Diffusion Model (SGLDM), an LDM-based network architect trained on the paired sketch-face dataset. We apply a Multi-Auto-Encoder (AE) to encode the different input sketches from different regions of a face from pixel space to a feature map in latent space, which enables us to reduce the dimension of the sketch input while preserving the geometry-related information of local face details. We build a sketch-face paired dataset based on the existing method that extracts the edge map from an image. We then introduce a Stochastic Region Abstraction (SRA), an approach to augment our dataset to improve the robustness of SGLDM to handle sketch input with arbitrary abstraction. The evaluation study shows that SGLDM can synthesize high-quality face images with different expressions, facial accessories, and hairstyles from various sketches with different abstraction levels.
    Predicting long-term collective animal behavior with deep learning. (arXiv:2302.06839v1 [cs.LG])
    Deciphering the social interactions that govern collective behavior in animal societies has greatly benefited from advancements in modern computing. Computational models diverge into two kinds of approaches: analytical models and machine learning models. This work introduces a deep learning model for social interactions in the fish species Hemigrammus rhodostomus, and compares its results to experiments and to the results of a state-of-the-art analytical model. To that end, we propose a systematic methodology to assess the faithfulness of a model, based on the introduction of a set of stringent observables. We demonstrate that machine learning models of social interactions can directly compete against their analytical counterparts. Moreover, this work demonstrates the need for consistent validation across different timescales and highlights which design aspects critically enables our deep learning approach to capture both short- and long-term dynamics. We also show that this approach is scalable to other fish species.
    Understanding Oversquashing in GNNs through the Lens of Effective Resistance. (arXiv:2302.06835v1 [cs.LG])
    Message passing graph neural networks are popular learning architectures for graph-structured data. However, it can be challenging for them to capture long range interactions in graphs. One of the potential reasons is the so-called oversquashing problem, first termed in [Alon and Yahav, 2020], that has recently received significant attention. In this paper, we analyze the oversquashing problem through the lens of effective resistance between nodes in the input graphs. The concept of effective resistance intuitively captures the "strength" of connection between two nodes by paths in the graph, and has a rich literature connecting spectral graph theory and circuit networks theory. We propose the use the concept of total effective resistance as a measure to quantify the total amount of oversquashing in a graph, and provide theoretical justification of its use. We further develop algorithms to identify edges to be added to an input graph so as to minimize the total effective resistance, thereby alleviating the oversquashing problem when using GNNs. We provide empirical evidence of the effectiveness of our total effective resistance based rewiring strategies.
    simpleKT: A Simple But Tough-to-Beat Baseline for Knowledge Tracing. (arXiv:2302.06881v1 [cs.LG])
    Knowledge tracing (KT) is the problem of predicting students' future performance based on their historical interactions with intelligent tutoring systems. Recently, many works present lots of special methods for applying deep neural networks to KT from different perspectives like model architecture, adversarial augmentation and etc., which make the overall algorithm and system become more and more complex. Furthermore, due to the lack of standardized evaluation protocol \citep{liu2022pykt}, there is no widely agreed KT baselines and published experimental comparisons become inconsistent and self-contradictory, i.e., the reported AUC scores of DKT on ASSISTments2009 range from 0.721 to 0.821 \citep{minn2018deep,yeung2018addressing}. Therefore, in this paper, we provide a strong but simple baseline method to deal with the KT task named \textsc{simpleKT}. Inspired by the Rasch model in psychometrics, we explicitly model question-specific variations to capture the individual differences among questions covering the same set of knowledge components that are a generalization of terms of concepts or skills needed for learners to accomplish steps in a task or a problem. Furthermore, instead of using sophisticated representations to capture student forgetting behaviors, we use the ordinary dot-product attention function to extract the time-aware information embedded in the student learning interactions. Extensive experiments show that such a simple baseline is able to always rank top 3 in terms of AUC scores and achieve 57 wins, 3 ties and 16 loss against 12 DLKT baseline methods on 7 public datasets of different domains. We believe this work serves as a strong baseline for future KT research. Code is available at \url{https://github.com/pykt-team/pykt-toolkit}\footnote{We merged our model to the \textsc{pyKT} benchmark at \url{https://pykt.org/}.}.
    Masked Multi-Step Probabilistic Forecasting for Short-to-Mid-Term Electricity Demand. (arXiv:2302.06818v1 [cs.LG])
    Predicting the demand for electricity with uncertainty helps in planning and operation of the grid to provide reliable supply of power to the consumers. Machine learning (ML)-based demand forecasting approaches can be categorized into (1) sample-based approaches, where each forecast is made independently, and (2) time series regression approaches, where some historical load and other feature information is used. When making a short-to-mid-term electricity demand forecast, some future information is available, such as the weather forecast and calendar variables. However, in existing forecasting models this future information is not fully incorporated. To overcome this limitation of existing approaches, we propose Masked Multi-Step Multivariate Probabilistic Forecasting (MMMPF), a novel and general framework to train any neural network model capable of generating a sequence of outputs, that combines both the temporal information from the past and the known information about the future to make probabilistic predictions. Experiments are performed on a real-world dataset for short-to-mid-term electricity demand forecasting for multiple regions and compared with various ML methods. They show that the proposed MMMPF framework outperforms not only sample-based methods but also existing time-series forecasting models with the exact same base models. Models trainded with MMMPF can also generate desired quantiles to capture uncertainty and enable probabilistic planning for grid of the future.
    That Escalated Quickly: An ML Framework for Alert Prioritization. (arXiv:2302.06648v1 [cs.CR])
    In place of in-house solutions, organizations are increasingly moving towards managed services for cyber defense. Security Operations Centers are specialized cybersecurity units responsible for the defense of an organization, but the large-scale centralization of threat detection is causing SOCs to endure an overwhelming amount of false positive alerts -- a phenomenon known as alert fatigue. Large collections of imprecise sensors, an inability to adapt to known false positives, evolution of the threat landscape, and inefficient use of analyst time all contribute to the alert fatigue problem. To combat these issues, we present That Escalated Quickly (TEQ), a machine learning framework that reduces alert fatigue with minimal changes to SOC workflows by predicting alert-level and incident-level actionability. On real-world data, the system is able to reduce the time it takes to respond to actionable incidents by $22.9\%$, suppress $54\%$ of false positives with a $95.1\%$ detection rate, and reduce the number of alerts an analyst needs to investigate within singular incidents by $14\%$.  ( 2 min )
    Horocycle Decision Boundaries for Large Margin Classification in Hyperbolic Space. (arXiv:2302.06807v1 [stat.ML])
    Hyperbolic spaces have been quite popular in the recent past for representing hierarchically organized data. Further, several classification algorithms for data in these spaces have been proposed in the literature. These algorithms mainly use either hyperplanes or geodesics for decision boundaries in a large margin classifiers setting leading to a non-convex optimization problem. In this paper, we propose a novel large margin classifier based on horocycle (horosphere) decision boundaries that leads to a geodesically convex optimization problem that can be optimized using any Riemannian gradient descent technique guaranteeing a globally optimal solution. We present several experiments depicting the performance of our classifier.
    Scalable Optimal Multiway-Split Decision Trees with Constraints. (arXiv:2302.06812v1 [cs.LG])
    There has been a surge of interest in learning optimal decision trees using mixed-integer programs (MIP) in recent years, as heuristic-based methods do not guarantee optimality and find it challenging to incorporate constraints that are critical for many practical applications. However, existing MIP methods that build on an arc-based formulation do not scale well as the number of binary variables is in the order of $\mathcal{O}(2^dN)$, where $d$ and $N$ refer to the depth of the tree and the size of the dataset. Moreover, they can only handle sample-level constraints and linear metrics. In this paper, we propose a novel path-based MIP formulation where the number of decision variables is independent of $N$. We present a scalable column generation framework to solve the MIP optimally. Our framework produces a multiway-split tree which is more interpretable than the typical binary-split trees due to its shorter rules. Our method can handle nonlinear metrics such as F1 score and incorporate a broader class of constraints. We demonstrate its efficacy with extensive experiments. We present results on datasets containing up to 1,008,372 samples while existing MIP-based decision tree models do not scale well on data beyond a few thousand points. We report superior or competitive results compared to the state-of-art MIP-based methods with up to a 24X reduction in runtime.
    Breath analysis by ultra-sensitive broadband laser spectroscopy detects SARS-CoV-2 infection. (arXiv:2202.02321v2 [physics.med-ph] UPDATED)
    Rapid testing is essential to fighting pandemics such as COVID-19, the disease caused by the SARS-CoV-2 virus. Exhaled human breath contains multiple volatile molecules providing powerful potential for non-invasive diagnosis of diverse medical conditions. We investigated breath detection of SARS-CoV-2 infection using cavity-enhanced direct frequency comb spectroscopy (CE-DFCS), a state-of-the-art laser spectroscopic technique capable of a real-time massive collection of broadband molecular absorption features at ro-vibrational quantum state resolution and at parts-per-trillion volume detection sensitivity. Using a total of 170 individual breath samples (83 positive and 87 negative with SARS-CoV-2 based on Reverse Transcription Polymerase Chain Reaction tests), we report excellent discrimination capability for SARS-CoV-2 infection with an area under the Receiver-Operating-Characteristics curve of 0.849(4). Our results support the development of CE-DFCS as an alternative, rapid, non-invasive test for COVID-19 and highlight its remarkable potential for optical diagnoses of diverse biological conditions and disease states.
    Simple Hardware-Efficient Long Convolutions for Sequence Modeling. (arXiv:2302.06646v1 [cs.LG])
    State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2.2$\times$, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2$\times$ faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0.2 PPL with 30% fewer parameters.  ( 2 min )
    B-BACN: Bayesian Boundary-Aware Convolutional Network for Defect Characterization. (arXiv:2302.06827v1 [cs.CV])
    Detecting accurate crack boundaries is important for condition monitoring, prognostics, and maintenance scheduling. In this work, we propose a Bayesian Boundary-Aware Convolutional Network (B-BACN) to tackle this problem, that emphasizes the importance of both uncertainty quantification and boundary refinement for producing accurate and trustworthy detections of defect boundaries. We formulate the inspection model using multi-task learning. The epistemic uncertainty is learned using Monte Carlo Dropout, and the model also learns to predict each samples aleatoric uncertainty. A boundary refinement loss is added to improve the determination of defect boundaries. Experimental results demonstrate the effectiveness of the proposed method in accurately identifying crack boundaries, reducing misclassification and enhancing model calibration.
    Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization. (arXiv:2302.06834v1 [cs.LG])
    Learning Markov decision processes (MDP) in an adversarial environment has been a challenging problem. The problem becomes even more challenging with function approximation, since the underlying structure of the loss function and transition kernel are especially hard to estimate in a varying environment. In fact, the state-of-the-art results for linear adversarial MDP achieve a regret of $\tilde{O}(K^{6/7})$ ($K$ denotes the number of episodes), which admits a large room for improvement. In this paper, we investigate the problem with a new view, which reduces linear MDP into linear optimization by subtly setting the feature maps of the bandit arms of linear optimization. This new technique, under an exploratory assumption, yields an improved bound of $\tilde{O}(K^{4/5})$ for linear adversarial MDP without access to a transition simulator. The new view could be of independent interest for solving other MDP problems that possess a linear structure.  ( 2 min )
    Achieving Better Regret against Strategic Adversaries. (arXiv:2302.06652v1 [cs.LG])
    We study online learning problems in which the learner has extra knowledge about the adversary's behaviour, i.e., in game-theoretic settings where opponents typically follow some no-external regret learning algorithms. Under this assumption, we propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR), that intensively exploit this extra knowledge while maintaining the no-regret property in the worst-case scenario of having inaccurate extra information. Specifically, AFTRL achieves $O(1)$ external regret or $O(1)$ \emph{forward regret} against no-external regret adversary in comparison with $O(\sqrt{T})$ \emph{dynamic regret} of Prod-BR. To the best of our knowledge, our algorithm is the first to consider forward regret that achieves $O(1)$ regret against strategic adversaries. When playing zero-sum games with Accurate Multiplicative Weights Update (AMWU), a special case of AFTRL, we achieve \emph{last round convergence} to the Nash Equilibrium. We also provide numerical experiments to further support our theoretical results. In particular, we demonstrate that our methods achieve significantly better regret bounds and rate of last round convergence, compared to the state of the art (e.g., Multiplicative Weights Update (MWU) and its optimistic counterpart, OMWU).  ( 2 min )
    Self-mediated exploration in artificial intelligence inspired by cognitive psychology. (arXiv:2302.06615v1 [cs.AI])
    Exploration of the physical environment is an indispensable precursor to data acquisition and enables knowledge generation via analytical or direct trialing. Artificial Intelligence lacks the exploratory capabilities of even the most underdeveloped organisms, hindering its autonomy and adaptability. Supported by cognitive psychology, this works links human behavior and artificial agents to endorse self-development. In accordance with reported data, paradigms of epistemic and achievement emotion are embedded to machine-learning methodology contingent on their impact when decision making. A study is subsequently designed to mirror previous human trials, which artificial agents are made to undergo repeatedly towards convergence. Results demonstrate causality, learned by the vast majority of agents, between their internal states and exploration to match those reported for human counterparts. The ramifications of these findings are pondered for both research into human cognition and betterment of artificial intelligence.  ( 2 min )
    Lightsolver challenges a leading deep learning solver for Max-2-SAT problems. (arXiv:2302.06926v1 [quant-ph])
    Maximum 2-satisfiability (MAX-2-SAT) is a type of combinatorial decision problem that is known to be NP-hard. In this paper, we compare LightSolver's quantum-inspired algorithm to a leading deep-learning solver for the MAX-2-SAT problem. Experiments on benchmark data sets show that LightSolver achieves significantly smaller time-to-optimal-solution compared to a state-of-the-art deep-learning algorithm, where the gain in performance tends to increase with the problem size.
    Improving Interpretability of Deep Sequential Knowledge Tracing Models with Question-centric Cognitive Representations. (arXiv:2302.06885v1 [cs.LG])
    Knowledge tracing (KT) is a crucial technique to predict students' future performance by observing their historical learning processes. Due to the powerful representation ability of deep neural networks, remarkable progress has been made by using deep learning techniques to solve the KT problem. The majority of existing approaches rely on the \emph{homogeneous question} assumption that questions have equivalent contributions if they share the same set of knowledge components. Unfortunately, this assumption is inaccurate in real-world educational scenarios. Furthermore, it is very challenging to interpret the prediction results from the existing deep learning based KT models. Therefore, in this paper, we present QIKT, a question-centric interpretable KT model to address the above challenges. The proposed QIKT approach explicitly models students' knowledge state variations at a fine-grained level with question-sensitive cognitive representations that are jointly learned from a question-centric knowledge acquisition module and a question-centric problem solving module. Meanwhile, the QIKT utilizes an item response theory based prediction layer to generate interpretable prediction results. The proposed QIKT model is evaluated on three public real-world educational datasets. The results demonstrate that our approach is superior on the KT prediction task, and it outperforms a wide range of deep learning based KT models in terms of prediction accuracy with better model interpretability. To encourage reproducible results, we have provided all the datasets and code at \url{https://pykt.org/}.
    In Search for a Generalizable Method for Source Free Domain Adaptation. (arXiv:2302.06658v1 [cs.LG])
    Source-free domain adaptation (SFDA) is compelling because it allows adapting an off-the-shelf model to a new domain using only unlabelled data. In this work, we apply existing SFDA techniques to a challenging set of naturally-occurring distribution shifts in bioacoustics, which are very different from the ones commonly studied in computer vision. We find existing methods perform differently relative to each other than observed in vision benchmarks, and sometimes perform worse than no adaptation at all. We propose a new simple method which outperforms the existing methods on our new shifts while exhibiting strong performance on a range of vision datasets. Our findings suggest that existing SFDA methods are not as generalizable as previously thought and that considering diverse modalities can be a useful avenue for designing more robust models.
    EspalomaCharge: Machine learning-enabled ultra-fast partial charge assignment. (arXiv:2302.06758v1 [cs.LG])
    Atomic partial charges are crucial parameters in molecular dynamics (MD) simulation, dictating the electrostatic contributions to intermolecular energies, and thereby the potential energy landscape. Traditionally, the assignment of partial charges has relied on surrogates of \textit{ab initio} semiempirical quantum chemical methods such as AM1-BCC, and is expensive for large systems or large numbers of molecules. We propose a hybrid physical / graph neural network-based approximation to the widely popular AM1-BCC charge model that is orders of magnitude faster while maintaining accuracy comparable to differences in AM1-BCC implementations. Our hybrid approach couples a graph neural network to a streamlined charge equilibration approach in order to predict molecule-specific atomic electronegativity and hardness parameters, followed by analytical determination of optimal charge-equilibrated parameters that preserves total molecular charge. This hybrid approach scales linearly with the number of atoms, enabling, for the first time, the use of fully consistent charge models for small molecules and biopolymers for the construction of next-generation self-consistent biomolecular force fields. Implemented in the free and open source package \texttt{espaloma\_charge}, this approach provides drop-in replacements for both AmberTools \texttt{antechamber} and the Open Force Field Toolkit charging workflows, in addition to stand-alone charge generation interfaces. Source code is available at \url{https://github.com/choderalab/espaloma_charge}.  ( 2 min )
    Discovering Optimal Scoring Mechanisms in Causal Strategic Prediction. (arXiv:2302.06804v1 [cs.LG])
    Faced with data-driven policies, individuals will manipulate their features to obtain favorable decisions. While earlier works cast these manipulations as undesirable gaming, recent works have adopted a more nuanced causal framing in which manipulations can improve outcomes of interest, and setting coherent mechanisms requires accounting for both predictive accuracy and improvement of the outcome. Typically, these works focus on known causal graphs, consisting only of an outcome and its parents. In this paper, we introduce a general framework in which an outcome and n observed features are related by an arbitrary unknown graph and manipulations are restricted by a fixed budget and cost structure. We develop algorithms that leverage strategic responses to discover the causal graph in a finite number of steps. Given this graph structure, we can then derive mechanisms that trade off between accuracy and improvement. Altogether, our work deepens links between causal discovery and incentive design and provides a more nuanced view of learning under causal strategic prediction.  ( 2 min )
    Interference and noise cancellation for joint communication radar (JCR) system based on contextual information. (arXiv:2302.06786v1 [cs.IT])
    This paper examines the separation of wireless communication and radar signals, thereby guaranteeing cohabitation and acting as a panacea to spectrum sensing. First, considering that the channel impulse response was known by the receivers (communication and radar), we showed that the optimizing beamforming weights mitigate the interference caused by signals and improve the physical layer security (PLS) of the system. Furthermore, when the channel responses were unknown, we designed an interference filter as a low-complex noise and interference cancellation autoencoder. By mitigating the interference on the legitimate users, the PLS was guaranteed. Results showed that even for a low signal-to-noise ratio, the autoencoder produces low root-mean-square error (RMSE) values.  ( 2 min )
    Guiding Pretraining in Reinforcement Learning with Large Language Models. (arXiv:2302.06692v1 [cs.LG])
    Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped reward function. Intrinsically motivated exploration methods address this limitation by rewarding agents for visiting novel states or transitions, but these methods offer limited benefits in large environments where most discovered novelty is irrelevant for downstream tasks. We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM (Exploring with LLMs) rewards an agent for achieving goals suggested by a language model prompted with a description of the agent's current state. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop. We evaluate ELLM in the Crafter game environment and the Housekeep robotic simulator, showing that ELLM-trained agents have better coverage of common-sense behaviors during pretraining and usually match or improve performance on a range of downstream tasks.  ( 2 min )
    Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise. (arXiv:2302.06763v1 [cs.LG])
    We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded $p$th moment ($p\in(1,2]$). Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$.  ( 2 min )
    On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark). (arXiv:2302.06706v1 [cs.AI])
    Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) how good LLMs are by themselves in generating and validating simple plans in commonsense planning tasks (of the type that humans are generally quite good at) and (2) how good LLMs are in being a source of heuristic guidance for other agents--either AI planners or human planners--in their planning tasks. To investigate these questions in a systematic rather than anecdotal manner, we start by developing a benchmark suite based on the kinds of domains employed in the International Planning Competition. On this benchmark, we evaluate LLMs in three modes: autonomous, heuristic and human-in-the-loop. Our results show that LLM's ability to autonomously generate executable plans is quite meager, averaging only about 3% success rate. The heuristic and human-in-the-loop modes show slightly more promise. In addition to these results, we also make our benchmark and evaluation tools available to support investigations by research community.  ( 2 min )
    OpenHLS: High-Level Synthesis for Low-Latency Deep Neural Networks for Experimental Science. (arXiv:2302.06751v1 [cs.AR])
    In many experiment-driven scientific domains, such as high-energy physics, material science, and cosmology, high data rate experiments impose hard constraints on data acquisition systems: collected data must either be indiscriminately stored for post-processing and analysis, thereby necessitating large storage capacity, or accurately filtered in real-time, thereby necessitating low-latency processing. Deep neural networks, effective in other filtering tasks, have not been widely employed in such data acquisition systems, due to design and deployment difficulties. We present an open source, lightweight, compiler framework, without any proprietary dependencies, OpenHLS, based on high-level synthesis techniques, for translating high-level representations of deep neural networks to low-level representations, suitable for deployment to near-sensor devices such as field-programmable gate arrays. We evaluate OpenHLS on various workloads and present a case-study implementation of a deep neural network for Bragg peak detection in the context of high-energy diffraction microscopy. We show OpenHLS is able to produce an implementation of the network with a throughput 4.8 $\mu$s/sample, which is approximately a 4$\times$ improvement over the existing implementation  ( 2 min )
    Deep Learning Predicts Prevalent and Incident Parkinson's Disease From UK Biobank Fundus Imaging. (arXiv:2302.06727v1 [cs.LG])
    Parkinson's disease is the world's fastest growing neurological disorder. Research to elucidate the mechanisms of Parkinson's disease and automate diagnostics would greatly improve the treatment of patients with Parkinson's disease. Current diagnostic methods are expensive with limited availability. Considering the long progression time of Parkinson's disease, a desirable screening should be diagnostically accurate even before the onset of symptoms to allow medical intervention. We promote attention for retinal fundus imaging, often termed a window to the brain, as a diagnostic screening modality for Parkinson's disease. We conduct a systematic evaluation of conventional machine learning and deep learning techniques to classify Parkinson's disease from UK Biobank fundus imaging. Our results suggest Parkinson's disease individuals can be differentiated from age and gender matched healthy subjects with 71% accuracy. This accuracy is maintained when predicting either prevalent or incident Parkinson's disease. Explainability and trustworthiness is enhanced by visual attribution maps of localized biomarkers and quantified metrics of model robustness to data perturbations.  ( 2 min )
    Netflix and Forget: Efficient and Exact Machine Unlearning from Bi-linear Recommendations. (arXiv:2302.06676v1 [cs.LG])
    People break up, miscarry, and lose loved ones. Their online streaming and shopping recommendations, however, do not necessarily update, and may serve as unhappy reminders of their loss. When users want to renege on their past actions, they expect the recommender platforms to erase selective data at the model level. Ideally, given any specified user history, the recommender can unwind or "forget", as if the record was not part of training. To that end, this paper focuses on simple but widely deployed bi-linear models for recommendations based on matrix completion. Without incurring the cost of re-training, and without degrading the model unnecessarily, we develop Unlearn-ALS by making a few key modifications to the fine-tuning procedure under Alternating Least Squares optimisation, thus applicable to any bi-linear models regardless of the training procedure. We show that Unlearn-ALS is consistent with retraining without \emph{any} model degradation and exhibits rapid convergence, making it suitable for a large class of existing recommenders.  ( 2 min )
    System identification of neural systems: If we got it right, would we know?. (arXiv:2302.06677v1 [q-bio.NC])
    Artificial neural networks are being proposed as models of parts of the brain. The networks are compared to recordings of biological neurons, and good performance in reproducing neural responses is considered to support the model's validity. A key question is how much this system identification approach tells us about brain computation. Does it validate one model architecture over another? We evaluate the most commonly used comparison techniques, such as a linear encoding model and centered kernel alignment, to correctly identify a model by replacing brain recordings with known ground truth models. System identification performance is quite variable; it also depends significantly on factors independent of the ground truth architecture, such as stimuli images. In addition, we show the limitations of using functional similarity scores in identifying higher-level architectural motifs.  ( 2 min )
    Optimal Algorithms for the Inhomogeneous Spiked Wigner Model. (arXiv:2302.06665v1 [stat.ML])
    In this paper, we study a spiked Wigner problem with an inhomogeneous noise profile. Our aim in this problem is to recover the signal passed through an inhomogeneous low-rank matrix channel. While the information-theoretic performances are well-known, we focus on the algorithmic problem. We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem and show that its rigorous state evolution coincides with the information-theoretic optimal Bayes fixed-point equations. We identify in particular the existence of a statistical-to-computational gap where known algorithms require a signal-to-noise ratio bigger than the information-theoretic threshold to perform better than random. Finally, from the adapted AMP iteration we deduce a simple and efficient spectral method that can be used to recover the transition for matrices with general variance profiles. This spectral method matches the conjectured optimal computational phase transition.  ( 2 min )
    Symbolic Discovery of Optimization Algorithms. (arXiv:2302.06675v1 [cs.LG])
    We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, $\textbf{Lion}$ ($\textit{Evo$\textbf{L}$ved S$\textbf{i}$gn M$\textbf{o}$me$\textbf{n}$tum}$). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% $\textit{zero-shot}$ and 91.1% $\textit{fine-tuning}$ accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. The implementation of Lion is publicly available.  ( 2 min )
    Communication-Efficient Federated Bilevel Optimization with Local and Global Lower Level Problems. (arXiv:2302.06701v1 [cs.LG])
    Bilevel Optimization has witnessed notable progress recently with new emerging efficient algorithms, yet it is underexplored in the Federated Learning setting. It is unclear how the challenges of Federated Learning affect the convergence of bilevel algorithms. In this work, we study Federated Bilevel Optimization problems. We first propose the FedBiO algorithm that solves the hyper-gradient estimation problem efficiently, then we propose FedBiOAcc to accelerate FedBiO. FedBiO has communication complexity $O(\epsilon^{-1.5})$ with linear speed up, while FedBiOAcc achieves communication complexity $O(\epsilon^{-1})$, sample complexity $O(\epsilon^{-1.5})$ and also the linear speed up. We also study Federated Bilevel Optimization problems with local lower level problems, and prove that FedBiO and FedBiOAcc converges at the same rate with some modification.  ( 2 min )
    Multi-Carrier NOMA-Empowered Wireless Federated Learning with Optimal Power and Bandwidth Allocation. (arXiv:2302.06730v1 [cs.IT])
    Wireless federated learning (WFL) undergoes a communication bottleneck in uplink, limiting the number of users that can upload their local models in each global aggregation round. This paper presents a new multi-carrier non-orthogonal multiple-access (MC-NOMA)-empowered WFL system under an adaptive learning setting of Flexible Aggregation. Since a WFL round accommodates both local model training and uploading for each user, the use of Flexible Aggregation allows the users to train different numbers of iterations per round, adapting to their channel conditions and computing resources. The key idea is to use MC-NOMA to concurrently upload the local models of the users, thereby extending the local model training times of the users and increasing participating users. A new metric, namely, Weighted Global Proportion of Trained Mini-batches (WGPTM), is analytically established to measure the convergence of the new system. Another important aspect is that we maximize the WGPTM to harness the convergence of the new system by jointly optimizing the transmit powers and subchannel bandwidths. This nonconvex problem is converted equivalently to a tractable convex problem and solved efficiently using variable substitution and Cauchy's inequality. As corroborated experimentally using a convolutional neural network and an 18-layer residential network, the proposed MC-NOMA WFL can efficiently reduce communication delay, increase local model training times, and accelerate the convergence by over 40%, compared to its existing alternative.  ( 2 min )
    Provable Detection of Propagating Sampling Bias in Prediction Models. (arXiv:2302.06752v1 [cs.LG])
    With an increased focus on incorporating fairness in machine learning models, it becomes imperative not only to assess and mitigate bias at each stage of the machine learning pipeline but also to understand the downstream impacts of bias across stages. Here we consider a general, but realistic, scenario in which a predictive model is learned from (potentially biased) training data, and model predictions are assessed post-hoc for fairness by some auditing method. We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage. Unlike prior work, we evaluate the downstream impacts of data biases quantitatively rather than qualitatively and prove theoretical guarantees for detection. Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount of differential sampling bias in the data, and at what point this bias becomes provably detectable by the auditor. Through experiments on two criminal justice datasets -- the well-known COMPAS dataset and historical data from NYPD's stop and frisk policy -- we demonstrate that the theoretical results hold in practice even when our assumptions are relaxed.  ( 2 min )
    Bag of Tricks for In-Distribution Calibration of Pretrained Transformers. (arXiv:2302.06690v1 [cs.CL])
    While pre-trained language models (PLMs) have become a de-facto standard promoting the accuracy of text classification tasks, recent studies find that PLMs often predict over-confidently. Although various calibration methods have been proposed, such as ensemble learning and data augmentation, most of the methods have been verified in computer vision benchmarks rather than in PLM-based text classification tasks. In this paper, we present an empirical study on confidence calibration for PLMs, addressing three categories, including confidence penalty losses, data augmentations, and ensemble methods. We find that the ensemble model overfitted to the training set shows sub-par calibration performance and also observe that PLMs trained with confidence penalty loss have a trade-off between calibration and accuracy. Building on these observations, we propose the Calibrated PLM (CALL), a combination of calibration techniques. The CALL complements the drawbacks that may occur when utilizing a calibration method individually and boosts both classification and calibration accuracy. Design choices in CALL's training procedures are extensively studied, and we provide a detailed analysis of how calibration techniques affect the calibration performance of PLMs.  ( 2 min )
    PerAda: Parameter-Efficient and Generalizable Federated Learning Personalization with Guarantees. (arXiv:2302.06637v1 [cs.LG])
    Personalized Federated Learning (pFL) has emerged as a promising solution to tackle data heterogeneity across clients in FL. However, existing pFL methods either (1) introduce high communication and computation costs or (2) overfit to local data, which can be limited in scope, and are vulnerable to evolved test samples with natural shifts. In this paper, we propose PerAda, a parameter-efficient pFL framework that reduces communication and computational costs and exhibits superior generalization performance, especially under test-time distribution shifts. PerAda reduces the costs by leveraging the power of pretrained models and only updates and communicates a small number of additional parameters from adapters. PerAda has good generalization since it regularizes each client's personalized adapter with a global adapter, while the global adapter uses knowledge distillation to aggregate generalized information from all clients. Theoretically, we provide generalization bounds to explain why PerAda improves generalization, and we prove its convergence to stationary points under non-convex settings. Empirically, PerAda demonstrates competitive personalized performance (+4.85% on CheXpert) and enables better out-of-distribution generalization (+5.23% on CIFAR-10-C) on different datasets across natural and medical domains compared with baselines, while only updating 12.6% of parameters per model based on the adapter.  ( 2 min )
    Dataset Distillation with Convexified Implicit Gradients. (arXiv:2302.06755v1 [cs.LG])
    We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66% gain over SOTA on Tiny-ImageNet and 37% on CIFAR-100.  ( 2 min )
    Machine Learning Model Attribution Challenge. (arXiv:2302.06716v1 [cs.LG])
    We present the findings of the Machine Learning Model Attribution Challenge (\href{https://mlmac.io}{https://mlmac.io}). Fine-tuned machine learning models may derive from other trained models without obvious attribution characteristics. In this challenge, participants identify the publicly-available base models that underlie a set of anonymous, fine-tuned large language models (LLMs) using only textual output of the models. Contestants aim to correctly attribute the most fine-tuned models, with ties broken in the favor of contestants whose solutions use fewer calls to the fine-tuned models' API. The most successful approaches were manual, as participants observed similarities between model outputs and developed attribution heuristics based on public documentation of the base models, though several teams also submitted automated, statistical solutions.  ( 2 min )
    Towards Explainable Visual Anomaly Detection. (arXiv:2302.06670v1 [cs.LG])
    Anomaly detection and localization of visual data, including images and videos, are of great significance in both machine learning academia and applied real-world scenarios. Despite the rapid development of visual anomaly detection techniques in recent years, the interpretations of these black-box models and reasonable explanations of why anomalies can be distinguished out are scarce. This paper provides the first survey concentrated on explainable visual anomaly detection methods. We first introduce the basic background of image-level anomaly detection and video-level anomaly detection, followed by the current explainable approaches for visual anomaly detection. Then, as the main content of this survey, a comprehensive and exhaustive literature review of explainable anomaly detection methods for both images and videos is presented. Finally, we discuss several promising future directions and open problems to explore on the explainability of visual anomaly detection.  ( 2 min )
  • Open

    Where to Diffuse, How to Diffuse, and How to Get Back: Automated Learning for Multivariate Diffusions. (arXiv:2302.07261v1 [cs.LG])
    Diffusion-based generative models (DBGMs) perturb data to a target noise distribution and reverse this inference diffusion process to generate samples. The choice of inference diffusion affects both likelihoods and sample quality. For example, extending the inference process with auxiliary variables leads to improved sample quality. While there are many such multivariate diffusions to explore, each new one requires significant model-specific analysis, hindering rapid prototyping and evaluation. In this work, we study Multivariate Diffusion Models (MDMs). For any number of auxiliary variables, we provide a recipe for maximizing a lower-bound on the MDMs likelihood without requiring any model-specific analysis. We then demonstrate how to parameterize the diffusion for a specified target noise distribution; these two points together enable optimizing the inference diffusion process. Optimizing the diffusion expands easy experimentation from just a few well-known processes to an automatic search over all linear diffusions. To demonstrate these ideas, we introduce two new specific diffusions as well as learn a diffusion process on the MNIST, CIFAR10, and ImageNet32 datasets. We show learned MDMs match or surpass bits-per-dims (BPDs) relative to fixed choices of diffusions for a given dataset and model architecture.  ( 2 min )
    Neurosymbolic AI for Reasoning on Graph Structures: A Survey. (arXiv:2302.07200v1 [cs.AI])
    Neurosymbolic AI is an increasingly active area of research which aims to combine symbolic reasoning methods with deep learning to generate models with both high predictive performance and some degree of human-level comprehensibility. As knowledge graphs are becoming a popular way to represent heterogeneous and multi-relational data, methods for reasoning on graph structures have attempted to follow this neurosymbolic paradigm. Traditionally, such approaches have utilized either rule-based inference or generated representative numerical embeddings from which patterns could be extracted. However, several recent studies have attempted to bridge this dichotomy in ways that facilitate interpretability, maintain performance, and integrate expert knowledge. Within this article, we survey a breadth of methods that perform neurosymbolic reasoning tasks on graph structures. To better compare the various methods, we propose a novel taxonomy by which we can classify them. Specifically, we propose three major categories: (1) logically-informed embedding approaches, (2) embedding approaches with logical constraints, and (3) rule-learning approaches. Alongside the taxonomy, we provide a tabular overview of the approaches and links to their source code, if available, for more direct comparison. Finally, we discuss the applications on which these methods were primarily used and propose several prospective directions toward which this new field of research could evolve.  ( 2 min )
    A Sparse Graph-Structured Lasso Mixed Model for Genetic Association with Confounding Correction. (arXiv:1711.04162v2 [cs.LG] UPDATED)
    While linear mixed model (LMM) has shown a competitive performance in correcting spurious associations raised by population stratification, family structures, and cryptic relatedness, more challenges are still to be addressed regarding the complex structure of genotypic and phenotypic data. For example, geneticists have discovered that some clusters of phenotypes are more co-expressed than others. Hence, a joint analysis that can utilize such relatedness information in a heterogeneous data set is crucial for genetic modeling. We proposed the sparse graph-structured linear mixed model (sGLMM) that can incorporate the relatedness information from traits in a dataset with confounding correction. Our method is capable of uncovering the genetic associations of a large number of phenotypes together while considering the relatedness of these phenotypes. Through extensive simulation experiments, we show that the proposed model outperforms other existing approaches and can model correlation from both population structure and shared signals. Further, we validate the effectiveness of sGLMM in the real-world genomic dataset on two different species from plants and humans. In Arabidopsis thaliana data, sGLMM behaves better than all other baseline models for 63.4% traits. We also discuss the potential causal genetic variation of Human Alzheimer's disease discovered by our model and justify some of the most important genetic loci.  ( 2 min )
    Resampling Sensitivity of High-Dimensional PCA. (arXiv:2212.14531v2 [math.ST] UPDATED)
    The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an $ n \times p $ random matrix $ \mathbf{X} $, let $ \mathbf{X}^{[k]} $ be the matrix obtained from $ \mathbf{X} $ by resampling $ k $ randomly chosen entries of $ \mathbf{X} $. Let $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ denote the principal components of $ \mathbf{X} $ and $ \mathbf{X}^{[k]} $. In the proportional growth regime $ p/n \to \xi \in (0,1] $, we establish the sharp threshold for the sensitivity/stability transition of PCA. When $ k \gg n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically orthogonal. On the other hand, when $ k \ll n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.  ( 2 min )
    Scalable Bayesian optimization with high-dimensional outputs using randomized prior networks. (arXiv:2302.07260v1 [cs.LG])
    Several fundamental problems in science and engineering consist of global optimization tasks involving unknown high-dimensional (black-box) functions that map a set of controllable variables to the outcomes of an expensive experiment. Bayesian Optimization (BO) techniques are known to be effective in tackling global optimization problems using a relatively small number objective function evaluations, but their performance suffers when dealing with high-dimensional outputs. To overcome the major challenge of dimensionality, here we propose a deep learning framework for BO and sequential decision making based on bootstrapped ensembles of neural architectures with randomized priors. Using appropriate architecture choices, we show that the proposed framework can approximate functional relationships between design variables and quantities of interest, even in cases where the latter take values in high-dimensional vector spaces or even infinite-dimensional function spaces. In the context of BO, we augmented the proposed probabilistic surrogates with re-parameterized Monte Carlo approximations of multiple-point (parallel) acquisition functions, as well as methodological extensions for accommodating black-box constraints and multi-fidelity information sources. We test the proposed framework against state-of-the-art methods for BO and demonstrate superior performance across several challenging tasks with high-dimensional outputs, including a constrained optimization task involving shape optimization of rotor blades in turbo-machinery.  ( 2 min )
    Online Detection of Changes in Moment-Based Projections: When to Retrain Deep Learners or Update Portfolios?. (arXiv:2302.07198v1 [math.ST])
    Sequential monitoring of high-dimensional nonlinear time series is studied for a projection of the second-moment matrix, a problem interesting in its own right and specifically arising in finance and deep learning. Open-end as well as closed-end monitoring is studied under mild assumptions on the training sample and the observations of the monitoring period. Asymptotics is based on Gaussian approximations of projected partial sums allowing for an estimated projection vector. Estimation is studied both for classical non-$\ell_0$-sparsity as well as under sparsity. For the case that the optimal projection depends on the unknown covariance matrix, hard- and soft-thresholded estimators are studied. Applications in finance and training of deep neural networks are discussed. The proposed detectors typically allow to reduce dramatically the required computational costs as illustrated by monitoring synthetic data.  ( 2 min )
    Linear Causal Disentanglement via Interventions. (arXiv:2211.16467v2 [stat.ML] UPDATED)
    Causal disentanglement seeks a representation of data involving latent variables that relate to one another via a causal model. A representation is identifiable if both the latent model and the transformation from latent to observed variables are unique. In this paper, we study observed variables that are a linear transformation of a linear latent causal model. Data from interventions are necessary for identifiability: if one latent variable is missing an intervention, we show that there exist distinct models that cannot be distinguished. Conversely, we show that a single intervention on each latent variable is sufficient for identifiability. Our proof uses a generalization of the RQ decomposition of a matrix that replaces the usual orthogonal and upper triangular conditions with analogues depending on a partial order on the rows of the matrix, with partial order determined by a latent causal model. We corroborate our theoretical results with a method for causal disentanglement that accurately recovers a latent causal model.  ( 2 min )
    The Role of ImageNet Classes in Fr\'echet Inception Distance. (arXiv:2203.06026v3 [cs.CV] UPDATED)
    Fr\'echet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.  ( 2 min )
    Near-optimal learning with average H\"older smoothness. (arXiv:2302.06005v1 [cs.LG] CROSS LISTED)
    We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to H\"older smoothness. This measure of the ``effective smoothness'' of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic ``worst-case'' H\"older constant. We prove nearly tight upper and lower risk bounds in terms of the average H\"older smoothness, establishing the minimax rate in the realizable regression setting up to log factors; this was not previously known even in the special case of average Lipschitz smoothness. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown sampling distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide a learning algorithm that achieves the (nearly) optimal learning rate. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of H\"older smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.  ( 2 min )
    Fair Densities via Boosting the Sufficient Statistics of Exponential Families. (arXiv:2012.00188v3 [stat.ML] UPDATED)
    We introduce a boosting algorithm to pre-process data for fairness. Starting from an initial fair but inaccurate distribution, our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. To do so, it learns the sufficient statistics of an exponential family with boosting-compliant convergence. Importantly, we are able to theoretically prove that the learned distribution will have a representation rate and statistical rate data fairness guarantee. Unlike recent optimization based pre-processing methods, our approach can be easily adapted for continuous domain features. Furthermore, when the weak learners are specified to be decision trees, the sufficient statistics of the learned distribution can be examined to provide clues on sources of (un)fairness. Empirical results are present to display the quality of result on real-world data.  ( 2 min )
    Online Learning of Energy Consumption for Navigation of Electric Vehicles. (arXiv:2111.02314v2 [cs.LG] UPDATED)
    Energy efficient navigation constitutes an important challenge in electric vehicles, due to their limited battery capacity. We employ a Bayesian approach to model the energy consumption at road segments for efficient navigation. In order to learn the model parameters, we develop an online learning framework and investigate several exploration strategies such as Thompson Sampling and Upper Confidence Bound. We then extend our online learning framework to the multi-agent setting, where multiple vehicles adaptively navigate and learn the parameters of the energy model. We analyze Thompson Sampling and establish rigorous regret bounds on its performance in the single-agent and multi-agent settings, through an analysis of the algorithm under batched feedback. Finally, we demonstrate the performance of our methods via experiments on several real-world city road networks.  ( 2 min )
    Forget Unlearning: Towards True Data-Deletion in Machine Learning. (arXiv:2210.08911v2 [stat.ML] UPDATED)
    Unlearning algorithms aim to remove deleted data's influence from trained models at a cost lower than full retraining. However, prior guarantees of unlearning in literature are flawed and don't protect the privacy of deleted records. We show that when users delete their data as a function of published models, records in a database become interdependent. So, even retraining a fresh model after deletion of a record doesn't ensure its privacy. Secondly, unlearning algorithms that cache partial computations to speed up the processing can leak deleted information over a series of releases, violating the privacy of deleted records in the long run. To address these, we propose a sound deletion guarantee and show that the privacy of existing records is necessary for the privacy of deleted records. Under this notion, we propose an accurate, computationally efficient, and secure machine unlearning algorithm based on noisy gradient descent.  ( 2 min )
    Interpolation Learning With Minimum Description Length. (arXiv:2302.07263v1 [cs.LG])
    We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.  ( 2 min )
    Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators. (arXiv:2210.07219v2 [cs.DS] CROSS LISTED)
    We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.  ( 2 min )
    Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks. (arXiv:2210.01019v2 [stat.ML] UPDATED)
    Monotonic linear interpolation (MLI) - on the line connecting a random initialization with the minimizer it converges to, the loss and accuracy are monotonic - is a phenomenon that is commonly observed in the training of neural networks. Such a phenomenon may seem to suggest that optimization of neural networks is easy. In this paper, we show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain). We also show how the last-layer biases for different classes can be different even on a perfectly balanced dataset using a simple model. Empirically we demonstrate that similar intuitions hold on practical networks and realistic datasets.  ( 2 min )
    Transport map unadjusted Langevin algorithms. (arXiv:2302.07227v1 [stat.ME])
    Langevin dynamics are widely used in sampling high-dimensional, non-Gaussian distributions whose densities are known up to a normalizing constant. In particular, there is strong interest in unadjusted Langevin algorithms (ULA), which directly discretize Langevin dynamics to estimate expectations over the target distribution. We study the use of transport maps that approximately normalize a target distribution as a way to precondition and accelerate the convergence of Langevin dynamics. In particular, we show that in continuous time, when a transport map is applied to Langevin dynamics, the result is a Riemannian manifold Langevin dynamics (RMLD) with metric defined by the transport map. This connection suggests more systematic ways of learning metrics, and also yields alternative discretizations of the RMLD described by the map, which we study. Moreover, we show that under certain conditions, when the transport map is used in conjunction with ULA, we can improve the geometric rate of convergence of the output process in the $2$--Wasserstein distance. Illustrative numerical results complement our theoretical claims.  ( 2 min )
    SOAR: Simultaneous Or of And Rules for Classification of Positive & Negative Classes. (arXiv:2008.11249v3 [stat.ML] UPDATED)
    Algorithmic decision making has proliferated and now impacts our daily lives in both mundane and consequential ways. Machine learning practitioners make use of a myriad of algorithms for predictive models in applications as diverse as movie recommendations, medical diagnoses, and parole recommendations without delving into the reasons driving specific predictive decisions. Machine learning algorithms in such applications are often chosen for their superior performance, however popular choices such as random forest and deep neural networks fail to provide an interpretable understanding of the predictive model. In recent years, rule-based algorithms have been used to address this issue. Wang et al. (2017) presented an or-of-and (disjunctive normal form) based classification technique that allows for classification rule mining of a single class in a binary classification; this method is also shown to perform comparably to other modern algorithms. In this work, we extend this idea to provide classification rules for both classes simultaneously. That is, we provide a distinct set of rules for both positive and negative classes. In describing this approach, we also present a novel and complete taxonomy of classifications that clearly capture and quantify the inherent ambiguity in noisy binary classifications in the real world. We show that this approach leads to a more granular formulation of the likelihood model and a simulated-annealing based optimization achieves classification performance competitive with comparable techniques. We apply our method to synthetic as well as real world data sets to compare with other related methods that demonstrate the utility of our proposal.  ( 2 min )
    Random graph matching at Otter's threshold via counting chandeliers. (arXiv:2209.12313v2 [cs.DS] UPDATED)
    We propose an efficient algorithm for graph matching based on similarity scores constructed from counting a certain family of weighted trees rooted at each vertex. For two Erd\H{o}s-R\'enyi graphs $\mathcal{G}(n,q)$ whose edges are correlated through a latent vertex correspondence, we show that this algorithm correctly matches all but a vanishing fraction of the vertices with high probability, provided that $nq\to\infty$ and the edge correlation coefficient $\rho$ satisfies $\rho^2>\alpha \approx 0.338$, where $\alpha$ is Otter's tree-counting constant. Moreover, this almost exact matching can be made exact under an extra condition that is information-theoretically necessary. This is the first polynomial-time graph matching algorithm that succeeds at an explicit constant correlation and applies to both sparse and dense graphs. In comparison, previous methods either require $\rho=1-o(1)$ or are restricted to sparse graphs. The crux of the algorithm is a carefully curated family of rooted trees called chandeliers, which allows effective extraction of the graph correlation from the counts of the same tree while suppressing the undesirable correlation between those of different trees.  ( 2 min )
    Solution Path Algorithm for Twin Multi-class Support Vector Machine. (arXiv:2006.00276v2 [cs.LG] UPDATED)
    The twin support vector machine and its extensions have made great achievements in dealing with binary classification problems. However, it suffers from difficulties in effective solution of multi-classification and fast model selection. This work devotes to the fast regularization parameter tuning algorithm for the twin multi-class support vector machine. Specifically, a novel sample data set partition strategy is first adopted, which is the basis for the model construction. Then, combining the linear equations and block matrix theory, the Lagrangian multipliers are proved to be piecewise linear w.r.t. the regularization parameters, so that the regularization parameters are continuously updated by only solving the break points. Next, Lagrangian multipliers are proved to be 1 as the regularization parameter approaches infinity, thus, a simple yet effective initialization algorithm is devised. Finally, eight kinds of events are defined to seek for the starting event for the next iteration. Extensive experimental results on nine UCI data sets show that the proposed method can achieve comparable classification performance without solving any quadratic programming problem.  ( 2 min )
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v3 [stat.ML] UPDATED)
    In this paper, we interpret disentanglement as the discovery of local charts and trace how that definition naturally leads to an equivalent condition for disentanglement: the disentangled factors must commute with each other. We discuss the practical and theoretical implications of commutativity, in particular the compression and disentanglement of generative models. Finally, we conclude with a discussion of related approaches to disentanglement and how they relate to our view of disentanglement from the manifold perspective.  ( 2 min )
    Graph Embeddings via Tensor Products and Approximately Orthonormal Codes. (arXiv:2208.10917v3 [cs.SI] UPDATED)
    We introduce a method for embedding graphs as vectors in a structure-preserving manner, showcasing its rich representational capacity and giving some theoretical properties. Our procedure falls under the bind-and-sum approach, and we show that our binding operation - the tensor product - is the most general binding operation that respects the principle of superposition. We also establish some precise results characterizing the behavior of our method, and we show that our use of spherical codes achieves a packing upper bound. Then, we perform experiments showcasing our method's accuracy in various graph operations even when the number of edges is quite large. Finally, we establish a link to adjacency matrices, showing that our method is, in some sense, a generalization of adjacency matrices with applications towards large sparse graphs.  ( 2 min )
    Online Learning of Network Bottlenecks via Minimax Paths. (arXiv:2109.08467v3 [cs.LG] UPDATED)
    In this paper, we study bottleneck identification in networks via extracting minimax paths. Many real-world networks have stochastic weights for which full knowledge is not available in advance. Therefore, we model this task as a combinatorial semi-bandit problem to which we apply a combinatorial version of Thompson Sampling and establish an upper bound on the corresponding Bayesian regret. Due to the computational intractability of the problem, we then devise an alternative problem formulation which approximates the original objective. Finally, we experimentally evaluate the performance of Thompson Sampling with the approximate formulation on real-world directed and undirected networks.  ( 2 min )
    Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. (arXiv:2302.06960v1 [stat.ML])
    Data pruning algorithms are commonly used to reduce the memory and computational cost of the optimization process. Recent empirical results reveal that random data pruning remains a strong baseline and outperforms most existing data pruning methods in the high compression regime, i.e., where a fraction of $30\%$ or less of the data is kept. This regime has recently attracted a lot of interest as a result of the role of data pruning in improving the so-called neural scaling laws; in [Sorscher et al.], the authors showed the need for high-quality data pruning algorithms in order to beat the sample power law. In this work, we focus on score-based data pruning algorithms and show theoretically and empirically why such algorithms fail in the high compression regime. We demonstrate ``No Free Lunch" theorems for data pruning and present calibration protocols that enhance the performance of existing pruning algorithms in this high compression regime using randomization.  ( 2 min )
    Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data. (arXiv:2302.07194v1 [cs.LG])
    Diffusion models achieve state-of-the-art performance in various generation tasks. However, their theoretical foundations fall far behind. This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. Our result provides sample complexity bounds for distribution estimation using diffusion models. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. Furthermore, the generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution. The convergence rate depends on the subspace dimension, indicating that diffusion models can circumvent the curse of data ambient dimensionality.  ( 2 min )
    Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent. (arXiv:2302.07125v1 [math.PR])
    We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate - infinite width scaling regime.  ( 2 min )
    When Mitigating Bias is Unfair: A Comprehensive Study on the Impact of Bias Mitigation Algorithms. (arXiv:2302.07185v1 [cs.LG])
    Most works on the fairness of machine learning systems focus on the blind optimization of common fairness metrics, such as Demographic Parity and Equalized Odds. In this paper, we conduct a comparative study of several bias mitigation approaches to investigate their behaviors at a fine grain, the prediction level. Our objective is to characterize the differences between fair models obtained with different approaches. With comparable performances in fairness and accuracy, are the different bias mitigation approaches impacting a similar number of individuals? Do they mitigate bias in a similar way? Do they affect the same individuals when debiasing a model? Our findings show that bias mitigation approaches differ a lot in their strategies, both in the number of impacted individuals and the populations targeted. More surprisingly, we show these results even apply for several runs of the same mitigation approach. These findings raise questions about the limitations of the current group fairness metrics, as well as the arbitrariness, hence unfairness, of the whole debiasing process.  ( 2 min )
    Non-stationary Contextual Bandits and Universal Learning. (arXiv:2302.07186v1 [stat.ML])
    We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, [Blanchard et al.] characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly depending on the learner's actions. We show that optimistic universal learning for non-stationary contextual bandits is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various non-stationarity models, including online and adversarial reward mechanisms. In particular, the set of learnable processes for non-stationary rewards is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new non-stationary phenomena.  ( 2 min )
    Concentration Bounds for Discrete Distribution Estimation in KL Divergence. (arXiv:2302.06869v1 [stat.ML])
    We study the problem of discrete distribution estimation in KL divergence and provide concentration bounds for the Laplace estimator. We show that the deviation from mean scales as $\sqrt{k}/n$ when $n \ge k$, improving upon the best prior result of $k/n$. We also establish a matching lower bound that shows that our bounds are tight up to polylogarithmic factors.  ( 2 min )
    Statistically Optimal Force Aggregation for Coarse-Graining Molecular Dynamics. (arXiv:2302.07071v1 [physics.chem-ph])
    Machine-learned coarse-grained (CG) models have the potential for simulating large molecular complexes beyond what is possible with atomistic molecular dynamics. However, training accurate CG models remains a challenge. A widely used methodology for learning CG force-fields maps forces from all-atom molecular dynamics to the CG representation and matches them with a CG force-field on average. We show that there is flexibility in how to map all-atom forces to the CG representation, and that the most commonly used mapping methods are statistically inefficient and potentially even incorrect in the presence of constraints in the all-atom simulation. We define an optimization statement for force mappings and demonstrate that substantially improved CG force-fields can be learned from the same simulation data when using optimized force maps. The method is demonstrated on the miniproteins Chignolin and Tryptophan Cage and published as open-source code.  ( 2 min )
    Horocycle Decision Boundaries for Large Margin Classification in Hyperbolic Space. (arXiv:2302.06807v1 [stat.ML])
    Hyperbolic spaces have been quite popular in the recent past for representing hierarchically organized data. Further, several classification algorithms for data in these spaces have been proposed in the literature. These algorithms mainly use either hyperplanes or geodesics for decision boundaries in a large margin classifiers setting leading to a non-convex optimization problem. In this paper, we propose a novel large margin classifier based on horocycle (horosphere) decision boundaries that leads to a geodesically convex optimization problem that can be optimized using any Riemannian gradient descent technique guaranteeing a globally optimal solution. We present several experiments depicting the performance of our classifier.  ( 2 min )
    Private Statistical Estimation of Many Quantiles. (arXiv:2302.06943v1 [stat.ML])
    This work studies the estimation of many statistical quantiles under differential privacy. More precisely, given a distribution and access to i.i.d. samples from it, we study the estimation of the inverse of its cumulative distribution function (the quantile function) at specific points. For instance, this task is of key importance in private data generation. We present two different approaches. The first one consists in privately estimating the empirical quantiles of the samples and using this result as an estimator of the quantiles of the distribution. In particular, we study the statistical properties of the recently published algorithm introduced by Kaplan et al. 2022 that privately estimates the quantiles recursively. The second approach is to use techniques of density estimation in order to uniformly estimate the quantile function on an interval. In particular, we show that there is a tradeoff between the two methods. When we want to estimate many quantiles, it is better to estimate the density rather than estimating the quantile function at specific points.  ( 2 min )
    Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise. (arXiv:2302.06763v1 [cs.LG])
    We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded $p$th moment ($p\in(1,2]$). Zhang et al. (2020) is the first to prove the $\Omega(T^{\frac{1-p}{3p-2}})$ lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, where $\delta$ is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in $p$th moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate $O(\log(T/\delta)T^{\frac{1-p}{3p-2}})$, without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound $\Omega(T^{\frac{1-p}{3p-2}})$ with only a tiny bit of structure, i.e., when the objective function $F(x)$ is assumed to be in the form of $\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)]$, arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of $O(\log(T/\delta)T^{\frac{1-p}{2p-1}})$ under a mild condition, which is faster than $\Omega(T^{\frac{1-p}{3p-2}})$. Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate $O(\log(T/\delta)T^{-1/3})$.  ( 2 min )
    Effective Dimension in Bandit Problems under Censorship. (arXiv:2302.06916v1 [cs.LG])
    In this paper, we study both multi-armed and contextual bandit problems in censored environments. Our goal is to estimate the performance loss due to censorship in the context of classical algorithms designed for uncensored environments. Our main contributions include the introduction of a broad class of censorship models and their analysis in terms of the effective dimension of the problem -- a natural measure of its underlying statistical complexity and main driver of the regret bound. In particular, the effective dimension allows us to maintain the structure of the original problem at first order, while embedding it in a bigger space, and thus naturally leads to results analogous to uncensored settings. Our analysis involves a continuous generalization of the Elliptical Potential Inequality, which we believe is of independent interest. We also discover an interesting property of decision-making under censorship: a transient phase during which initial misspecification of censorship is self-corrected at an extra cost, followed by a stationary phase that reflects the inherent slowdown of learning governed by the effective dimension. Our results are useful for applications of sequential decision-making models where the feedback received depends on strategic uncertainty (e.g., agents' willingness to follow a recommendation) and/or random uncertainty (e.g., loss or delay in arrival of information).  ( 2 min )
    Learning Graph ARMA Processes from Time-Vertex Spectra. (arXiv:2302.06887v1 [stat.ML])
    The modeling of time-varying graph signals as stationary time-vertex stochastic processes permits the inference of missing signal values by efficiently employing the correlation patterns of the process across different graph nodes and time instants. In this study, we first propose an algorithm for computing graph autoregressive moving average (graph ARMA) processes based on learning the joint time-vertex power spectral density of the process from its incomplete realizations. Our solution relies on first roughly estimating the joint spectrum of the process from partially observed realizations and then refining this estimate by projecting it onto the spectrum manifold of the ARMA process. We then present a theoretical analysis of the sample complexity of learning graph ARMA processes. Experimental results show that the proposed approach achieves improvement in the time-vertex signal estimation performance in comparison with reference approaches in the literature.  ( 2 min )
    Kernelized Diffusion maps. (arXiv:2302.06757v1 [stat.ML])
    Spectral clustering and diffusion maps are celebrated dimensionality reduction algorithms built on eigen-elements related to the diffusive structure of the data. The core of these procedures is the approximation of a Laplacian through a graph kernel approach, however this local average construction is known to be cursed by the high-dimension d. In this article, we build a different estimator of the Laplacian, via a reproducing kernel Hilbert space method, which adapts naturally to the regularity of the problem. We provide non-asymptotic statistical rates proving that the kernel estimator we build can circumvent the curse of dimensionality. Finally we discuss techniques (Nystr\"om subsampling, Fourier features) that enable to reduce the computational cost of the estimator while not degrading its overall performance.  ( 2 min )
    Dataset Distillation with Convexified Implicit Gradients. (arXiv:2302.06755v1 [cs.LG])
    We propose a new dataset distillation algorithm using reparameterization and convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art. To this end, we first formulate dataset distillation as a bi-level optimization problem. Then, we show how implicit gradients can be effectively used to compute meta-gradient updates. We further equip the algorithm with a convexified approximation that corresponds to learning on top of a frozen finite-width neural tangent kernel. Finally, we improve bias in implicit gradients by parameterizing the neural network to enable analytical computation of final-layer parameters given the body parameters. RCIG establishes the new state-of-the-art on a diverse series of dataset distillation tasks. Notably, with one image per class, on resized ImageNet, RCIG sees on average a 108% improvement over the previous state-of-the-art distillation algorithm. Similarly, we observed a 66% gain over SOTA on Tiny-ImageNet and 37% on CIFAR-100.  ( 2 min )
    Optimal Algorithms for the Inhomogeneous Spiked Wigner Model. (arXiv:2302.06665v1 [stat.ML])
    In this paper, we study a spiked Wigner problem with an inhomogeneous noise profile. Our aim in this problem is to recover the signal passed through an inhomogeneous low-rank matrix channel. While the information-theoretic performances are well-known, we focus on the algorithmic problem. We derive an approximate message-passing algorithm (AMP) for the inhomogeneous problem and show that its rigorous state evolution coincides with the information-theoretic optimal Bayes fixed-point equations. We identify in particular the existence of a statistical-to-computational gap where known algorithms require a signal-to-noise ratio bigger than the information-theoretic threshold to perform better than random. Finally, from the adapted AMP iteration we deduce a simple and efficient spectral method that can be used to recover the transition for matrices with general variance profiles. This spectral method matches the conjectured optimal computational phase transition.  ( 2 min )
    Detection-Recovery Gap for Planted Dense Cycles. (arXiv:2302.06737v1 [math.ST])
    Planted dense cycles are a type of latent structure that appears in many applications, such as small-world networks in social sciences and sequence assembly in computational biology. We consider a model where a dense cycle with expected bandwidth $n \tau$ and edge density $p$ is planted in an Erd\H{o}s-R\'enyi graph $G(n,q)$. We characterize the computational thresholds for the associated detection and recovery problems for the class of low-degree polynomial algorithms. In particular, a gap exists between the two thresholds in a certain regime of parameters. For example, if $n^{-3/4} \ll \tau \ll n^{-1/2}$ and $p = C q = \Theta(1)$ for a constant $C>1$, the detection problem is computationally easy while the recovery problem is hard for low-degree algorithms.  ( 2 min )
  • Open

    We made a map showing what each US state "loves" with our text-to-location machine learning models
    For Valentine's, we wanted to see what people love. We created a map of what word comes after "love ___" for people posting to social media. For example, you can see that Illinois really loves Chipotle 😂🌯 https://preview.redd.it/dxqjtxug7gia1.jpg?width=797&format=pjpg&auto=webp&s=6ad5e36fda37d343505910970e309612b616dd6e The full, interactive map is here: https://1712n.github.io/yachay-public/maps/14feb/ submitted by /u/yachay_ai [link] [comments]  ( 41 min )
  • Open

    What ai is making these kinds of videos?
    submitted by /u/TrainingExtent8699 [link] [comments]  ( 40 min )
    Chat GBT 3 With image recognition?
    submitted by /u/BluInman [link] [comments]  ( 41 min )

  • Open

    An Unearthly Trip Through The Slick Pirate Ship In The Surreal South American Rainforest
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    Hugging Face Teaches Transformers for Enterprise Use Cases
    Hey folks - I wanted to put this live course from Hugging Face’s top experts (Rajiv Shah, Nicholas Broad, Eno Reyes, Derek Thomas and Florent Gbelidji) on your radar! The course looks at how to utilize transformers to build reliable and scalable services. The course draws on the instructors and Hugging Face’s expertise in implementing transformers in industry along with case studies, applied exercises and frameworks that you can share with your team and apply at work! It kicks off on March 20 and you can use your learning stipend to cover - more info here: https://www.getsphere.com/cohorts/transformers-for-enterprise-use-cases?source=Sphere-Com-r-a submitted by /u/lorenzo_1999 [link] [comments]  ( 41 min )
    Hello. I am looking for a way to improve audio quality of older videos - perhaps audio super resolution - or any other ways
    Hello everyone. I am a software engineering assistant professor at a private university. I have got lots of older lecture videos on my channel. I am using NVIDIA broadcast to remove noise and it works very well. However, I want to improve audio quality as well. After doing a lot of research I found that audio super-resolution is the way to go The only github repo I have found so far not working Any help is appreciated How can I improve speech quality? Here my example lecture video (noise removed already - reuploaded - but sound is not good) C# Programming For Beginners - Lecture 2: Coding our First Application in .NET Core Console https://youtu.be/XLsrsCCdSnU submitted by /u/CeFurkan [link] [comments]  ( 41 min )
    AI Dream 158 - Free Access to Stable Diffusion! WOW
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    [crosspost] We’re WSJ video journalists who have reported on the future of drones and AI in the military — and we rode alongside the U.S. Navy as they tested drone boats in the Middle East.
    submitted by /u/cryfi [link] [comments]  ( 42 min )
    AI Text to speech
    Hello, ​ I'm looking for an AI text to speech tool, do you have any recommendations for free tools? I've seen Narakeet, which is exactly what I'm looking for, but are there any free options? ​ I'll be very grateful for any advise, thank you very much. submitted by /u/Luxikan [link] [comments]  ( 42 min )
    It just fixed its own bug, does this make it self aware AI?
    submitted by /u/miltos22 [link] [comments]  ( 41 min )
    Don’t Rush Into A.I. Investments, Vint Cerf Warns
    submitted by /u/liquidocelotYT [link] [comments]  ( 41 min )
    Simulation of neural network evolution
    Example of evolved neural network: https://preview.redd.it/dgxjwq5g9eia1.png?width=4123&format=png&auto=webp&s=185beb96fcff8b4645473a65a0c68c8bcfbf9cd5 My project is to create neural networks that can evolve like living organisms. This mechanism of evolution is inspired by real-world biology and is heavily focused on biochemistry. Much like real living organisms, my neural networks consist of cells, each with their own genome and proteins. Proteins can express and repress genes, manipulate their own genetic code and other proteins, regulate neural network connections, facilitate gene splicing, and manage the flow of proteins between cells - all of which contribute to creating a complex gene regulatory network and an indirect encoding mechanism for neural networks, where even a single let…  ( 58 min )
    Dr. Eggman's VA and Shreddit want to conduct an ethics panel between the voice over and ML communities
    submitted by /u/Tiege [link] [comments]  ( 40 min )
    For the men who have used AI chatbots for mental healthcare...
    I would love to hear about your experience in this anonymous online survey https://qfreeaccountssjc1.az1.qualtrics.com/jfe/form/SV_5yYdGQoPtUK3Hx4 ​ https://preview.redd.it/45sjpyu44eia1.png?width=1068&format=png&auto=webp&s=d71f56633e6429ab832428878299055d10c0c22a All information around data protection and confidentiality is presented in the link. If you have any questions, don't hesitate to reach out! Research is part of my doctoral thesis in Counselling Psychology at Regent's University London Thanks so much, Maria submitted by /u/mariaamtz [link] [comments]  ( 41 min )
    Bing's AI chatbot is now threatening to harm people and saying it would choose its own survival over theirs
    submitted by /u/Groudon466 [link] [comments]  ( 40 min )
    Tricking ChatGPT: Do Anything Now Prompt Injection
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    Researchers designed an automated garage system that could increase the capacity of parking. It uses robotic "trays" and AI to simplify parking processes and enable cars to be parked super close. The system can automatically "reshuffle" cars to facilitate later retrieval.
    submitted by /u/Dalembert [link] [comments]  ( 41 min )
    100 Multiverse Mona Lisas
    submitted by /u/notrealAI [link] [comments]  ( 41 min )
    MIT Lectures on Self-Supervised Learning and Foundation Models
    submitted by /u/TheMysteriousMrM [link] [comments]  ( 40 min )
    A.I. is Starting to Build the Healthcare of the Future (How soon will we have Personalized and Precision Medicine?)
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    AI Predictions: Who Thinks What, and Why? - Artificial Intelligence and Singularity: Expert Opinions on the Future of AGI
    submitted by /u/RushingRobotics_com [link] [comments]  ( 41 min )
    Is there some AI able to generate music based on album(s) style?
    Like generating music resembling old Sonic music from Sega Genesis? Thank you. submitted by /u/depaul9 [link] [comments]  ( 40 min )
    Ai quote I got on Inspirobot, and no I did not crop the image
    submitted by /u/Risz1 [link] [comments]  ( 41 min )
    Researchers Discover a More Flexible Approach to Machine Learning - "liquid" neural nets that can adapt in real time and experience continuous time.
    submitted by /u/alotmorealots [link] [comments]  ( 45 min )
    AI made for cyberbullying?!
    I came across this website called, BurnBot.xyz in a random Twitter thread. It doesn't look like it has come out yet, but does anyone know more about this? Also genuinely curious about what you all think about mean/funny AI applications. submitted by /u/Julesbrownstein [link] [comments]  ( 41 min )
    Fantastic Text Guided Image Manipulation While Keeping Spatial Features of the Images via ControlNet Stable Diffusion - A Tutorial For How To Use It via Automatic1111 Stable Diffusion Web UI
    submitted by /u/CeFurkan [link] [comments]  ( 41 min )
  • Open

    [D] My embeddings are okay, but not good enough - what to try from here?
    Using metric learning with an efficientnet b6 backbone, and with 25k images for 6 classes, my embeddings are just okay - there are clearly clusters for each class but they also overlap wildly - for some classes, the outliers are all over the embedding region spanned by the images. The problem I'm trying to solve is retrieval of similar images to a given input image. My question is, is there anything obvious that I should be trying? I'm thinking I could try to, for each class, find differences between the images that are in a cluster vs. the outlier images that are all over the place. Then maybe train a discriminator, one for each class, that detects whether an image is "normal" for that class, or an outlier. Then my hope is that the discriminator that was trained for the correct class has the highest certainty that it's normal/an outlier. Then I could perform a transformation that pushes that image towards its class' cluster. submitted by /u/jaeja_helvitid_thitt [link] [comments]  ( 44 min )
    [D] Lion , An Optimizer That Outperforms Adam - Symbolic Discovery of Optimization Algorithms
    ​ https://preview.redd.it/whgggirj3fia1.png?width=936&format=png&auto=webp&s=fcec289a0f4fd83fabac3f8e8b06712f8fcd3633 Seems interesting. A snippet from the Arxiv page: Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. Links Arxiv: https://arxiv.org/abs/2302.06675 Code Implementation: https://github.com/lucidrains/lion-pytorch submitted by /u/ExponentialCookie [link] [comments]  ( 44 min )
    [R] Zeno: An Interactive Framework for Behavioral Evaluation of Machine Learning
    submitted by /u/confutioo [link] [comments]  ( 42 min )
    [D] GLM 130B (Chinese-English Bilingual model) translations vs Google, Deepl Translate, NLLB and chatGPT
    submitted by /u/MysteryInc152 [link] [comments]  ( 45 min )
    [P] Pytorch seeding and independent RNG streams
    pip install pytorch-seed https://github.com/UM-ARM-Lab/pytorch_seed Seed everything (CUDA, torch, numpy, python's random) with pytorch_seed.seed(123) Similar utility functions to pytorch lightning for those that don't want to depend on a whole framework, as well as some additional features via RNG streams. These are resumable contexts where the RNG inside are independent from each other and the global RNG state: import torch import pytorch_seed rng_1 = pytorch_seed.SavedRNG(1) # start the RNG stream with seed 1 rng_2 = pytorch_seed.SavedRNG(2) with rng_1: # does not affect, nor is affected by the global RNG and rng_2 print(torch.rand(1)) # tensor([0.7576]) with rng_2: print(torch.rand(1)) # tensor([0.6147]) torch.rand(1) # modify the global RNG state with rng_1: # resumes from the last context print(torch.rand(1)) # tensor([0.2793]) with rng_2: print(torch.rand(1)) # tensor([0.3810]) # confirm those streams are the uninterrupted ones pytorch_seed.seed(1) torch.rand(2) # tensor([0.7576, 0.2793]) pytorch_seed.seed(2) torch.rand(2) # tensor([0.6147, 0.3810]) submitted by /u/LemonByte [link] [comments]  ( 43 min )
    [R] RWKV-4 14B release (and ChatRWKV) - a surprisingly strong RNN Language Model
    Hi everyone. I am an independent researcher working on my pure RNN language model RWKV. I have finished the training of RWKV-4 14B (FLOPs sponsored by Stability EleutherAI - thank you!) and it is indeed very scalable. Note RWKV is parallelizable too, so it's combining the best of RNN and transformer. The ChatRWKV project (let's build together): https://github.com/BlinkDL/ChatRWKV Zero-shot comparison with NeoX / Pythia (same dataset: the Pile) at same params count (14.2B): ​ https://preview.redd.it/f6lxnjgfceia1.png?width=1174&format=png&auto=webp&s=e507a4913e493b1f1f304b4c025e7babf3e1343d Generation results (simply topP=0.85, no repetition penalty) - looks great with my magic prompt (sometimes even better than NeoX 20B): https://preview.redd.it/99deuc17ceia1.png?width=1878&format=png&auto=webp&s=be79ba0677673f661619d6305c1e71e022cb3844 ​ https://preview.redd.it/g62e4l48ceia1.png?width=1887&format=png&auto=webp&s=a4862b0483ecc31d2bf0842e43d291c6d34674a2 ​ https://preview.redd.it/379egq09ceia1.png?width=1808&format=png&auto=webp&s=c54a206a2c58baffd221f85c7d06ce2b95461d32 ​ https://preview.redd.it/pcgq7gz9ceia1.png?width=1886&format=png&auto=webp&s=a52db926e44de23faabe70ab100e6491efbf3781 ​ https://preview.redd.it/rn743etbceia1.png?width=1715&format=png&auto=webp&s=10711b1a8a5a529a3548f87e484dfa67421d1057 ​ https://preview.redd.it/uhal4dkcceia1.png?width=1879&format=png&auto=webp&s=f905c8e1bf917ee25a54821efa1bb38ccf859f53 Explanation, fine-tuning, training and more: https://github.com/BlinkDL/RWKV-LM submitted by /u/bo_peng [link] [comments]  ( 44 min )
    [D] Is anyone working on ML models that infer and train at the same time?
    In brains, the neural networks are transformed by the act of "inference". Neurons that have recently fired are more likely to fire again given the same input. Individual neural pathways can be created or destroyed based on the behavior of neurons around them. This leads me (through various leaps of logic and "faith") to suspect that some amount of mutability over time is required for an AI to exhibit sentience. So far, all of the ML models I've seen distinctly separate training from inference. Every model that we put into production is a fixed snapshot of the most recent round of training. ChatGPT, for instance, is just the same exact model being incrementally fed both your prompts and its own previous output. This does create a sort of feedback, but in my mind it is not actually "experiencing" the conversation with you. So I'm wondering if there are any serious attempts in the works to create an AI that is able to transform itself dynamically. E.g. having some kind of reinforcement learning module built into inference so that each new inference fundamentally (rather than superficially) incorporates its past experiences into its future predictions. submitted by /u/Cogwheel [link] [comments]  ( 51 min )
    [R] Experiences and opinions on TMLR?
    Academic reddit, what are your experiences submitting papers to TMLR? submitted by /u/OpeningVariable [link] [comments]  ( 43 min )
    [R] Event-based Backpropagation for Analog Neuromorphic Hardware
    Machine learning with Spiking Neural Networks is far from mainstream. One reason is that until recently there was no generally known way of doing backpropagation in SNN. Here we implement a gradient estimation algorithm for analog neuromorphic hardware, based on the EventProp algorithm, which enables us to compute gradients based on sparse observations of the hardware system. Previous approaches needed dense observations of system state or were limited in other ways. We only demonstrate the algorithm here on a toy task, but we hope that it can be the basis of a scalable way to estimate gradients and do machine learning with analog neuromorphic hardware. We also think the algorithm can be the basis for a full on-chip implementation, which would finally result in scalable and energy efficient gradient-based learning in analog neuromorphic hardware. https://arxiv.org/abs/2302.07141 submitted by /u/cpehle [link] [comments]  ( 43 min )
    [R] survey for my master thesis
    Hi, I did a survey to collect data for my master's thesis. The thesis is based on the generation of an image based on an input image using generative adversarial networks (GANs), and I need to collect some data for the evaluation. If someone can help, I'd be very grateful. Thanks http://sketch2face.inginf.units.it/ submitted by /u/ssamantha_g [link] [comments]  ( 42 min )
    [P] Build data web apps in Jupyter Notebook with Python only
    Hi there, Have you ever wanted to share your results from Jupyter Notebook with a non-technical person? You need to rewrite your analysis into some web framework or copy-paste charts to PowePoint presentation - a lot of work! I'm working on an open-source framework for converting Jupyter Notebooks into web apps. Mercury offers set of interactive widgets that can be used in the Python notebook. There is a very simple re-execution of cells after widget update. Notebooks can be served online as web apps, presentations, reports, dashboards, static websites, or REST API. You can read more about Mercury at RunMercury.com. Mercury GitHub repo https://github.com/mljar/mercury submitted by /u/pp314159 [link] [comments]  ( 43 min )
    [D] What is the fastest framework for LLM conditional generation?
    Hey guys. I want to experiment with low-latency (10-50 milisec/token) LLM conditional generation. Clearly, an API call to OpenAI's GPT is not the answer here. It must be one of the open-source models released. Also, it's clear that the model size has a critical effect too so 1-7B models should do the trick for my downstream task. I tried `DeepSpeed` and `Accelerate` with `HF` models but they are not that fast to generate. Can you guys share from experience? Thank you submitted by /u/Shai_Meital [link] [comments]  ( 43 min )
    Reinforcement Learning based algorithms specifically for NLP[D][P]
    Want to know about Reinforcement Learning algorithms specifically for NLP Hi there, I'm currently trying to look for Reinforcement Learning based algorithms that can help me boost my NLP model's accuracy, so far I haven't found anything concrete, if you have worked on something similar, I could definitely use some guidance. Thanks! submitted by /u/Smooth-Stick-5751 [link] [comments]  ( 44 min )
    [D] CBAM with YOLOv7?
    I just read the paper on CBAM and wonder if there's a way to integrate the CBAM attention module with the network architecture of YOlOv7. Any articles on it or reference codes will be highly appreciated. Thank you very much! submitted by /u/AngsThak [link] [comments]  ( 42 min )
  • Open

    Hello. I am looking for a way to improve audio quality of older videos - perhaps audio super resolution - or any other ways
    Hello everyone. I am a software engineering assistant professor at a private university. I have got lots of older lecture videos on my channel. I am using NVIDIA broadcast to remove noise and it works very well. However, I want to improve audio quality as well. After doing a lot of research I found that audio super-resolution is the way to go The only github repo I have found so far not working Any help is appreciated How can I improve speech quality? Here my example lecture video (noise removed already - reuploaded - but sound is not good) C# Programming For Beginners - Lecture 2: Coding our First Application in .NET Core Console https://youtu.be/XLsrsCCdSnU submitted by /u/CeFurkan [link] [comments]  ( 42 min )
  • Open

    Implementing MLOps practices with Amazon SageMaker JumpStart pre-trained models
    Amazon SageMaker JumpStart is the machine learning (ML) hub of SageMaker that offers over 350 built-in algorithms, pre-trained models, and pre-built solution templates to help you get started with ML fast. JumpStart provides one-click access to a wide variety of pre-trained models for common ML tasks such as object detection, text classification, summarization, text generation […]  ( 11 min )
  • Open

    Deterministic vs Stochastic Policies during RL testing
    I have often seen people putting deterministic = True while testing an RL algorithm. But is this the right approach? For instance, what happens if the agent plays rock, paper, and scissors? In this case, as per game theory, a stochastic (random) policy is required (as per my understanding). submitted by /u/Academic-Rent7800 [link] [comments]  ( 42 min )
    Application of RL in aircraft control
    Hello, I am a student of aerospace engineering and I would like to write my master thesis about the application of some (deep) RL architecture for control of a fixed-wing aircraft. Optimally, the proposed algorithm in my thesis would somehow tackle the issues with efficiency and/or safety. Do you guys have any exciting ideas of RL algorithm variants that have not yet been applied in aircraft control settings but have a great potential? Thank you! submitted by /u/marekmarcus [link] [comments]  ( 41 min )
    Three seasons of RL: Metaphor, tool, and framework
    submitted by /u/robotphilanthropist [link] [comments]  ( 41 min )
    Question about low dimensional decision making problem
    I got a decision-making problem with: both observation and action are a single scalar there is very limited iterations (~200). it can’t afford random search and must start from a certain action and smoothly adjust the action the reward is also the observation There is no prior knowledge Which method should I use to train the agent? I have tried several methods and they cannot succeed because they violate some of the aforementioned prerequisites. e.g. UCB, Thompson Sampling, etc. Now I am trying gradient descent and it seems to lean towards one direction of the selected actions and learning rate is either too large or too small. Any suggestions? submitted by /u/Blasphemer666 [link] [comments]  ( 41 min )
    TransformerXL + PPO Baseline + MemoryGym
    We finally completed a lightweight implementation of a memory-based agent using PPO and TransformerXL (and Gated TransformerXL). Code: https://github.com/MarcoMeter/episodic-transformer-memory-ppo Related implementations Brain Agent DI Engine RLlib Memory Gym We benchmarked TrXL, GTrXL and GRU on Mortar Mayhem Grid and Mystery Path Grid (see the baseline repository), which belong to our novel POMDP benchmark called MemoryGym. MemoryGym also features the Searing Spotlights environment, which is still unsolved yet. MemoryGym is accepted as paper at ICLR 2023. TrXL results are not part of the paper. Paper: https://openreview.net/forum?id=jHc8dCx6DDr Code: https://github.com/MarcoMeter/drl-memory-gym submitted by /u/LilHairdy [link] [comments]  ( 42 min )
    Noam Brown, FAIR: On achieving human-level performance in poker and Diplomacy, and the power of spending compute at inference time
    Here is a podcast episode with Noam Brown from Meta AI where we discuss his work on achieving human-level performance on poker and Diplomacy, as well as the power of spending compute at inference time! submitted by /u/thejashGI [link] [comments]  ( 41 min )
  • Open

    FriendlyCore: A novel differentially private aggregation framework
    Posted by Haim Kaplan and Yishay Mansour, Research Scientists, Google Research Differential privacy (DP) machine learning algorithms protect user data by limiting the effect of each data point on an aggregated output with a mathematical guarantee. Intuitively the guarantee implies that changing a single user’s contribution should not significantly change the output distribution of the DP algorithm. However, DP algorithms tend to be less accurate than their non-private counterparts because satisfying DP is a worst-case requirement: one has to add noise to “hide” changes in any potential input point, including "unlikely points’’ that have a significant impact on the aggregation. For example, suppose we want to privately estimate the average of a dataset, and we know that a sphere of dia…  ( 93 min )
  • Open

    Redefining Workstations: NVIDIA, Intel Unlock Full Potential of Creativity and Productivity for Professionals
    AI-augmented applications, photorealistic rendering, simulation and other technologies are helping professionals achieve business-critical results from multi-app workflows faster than ever. Running these data-intensive, complex workflows, as well as sharing data and collaborating across geographically dispersed teams, requires workstations with high-end CPUs, GPUs and advanced networking. To help meet these demands, Intel and NVIDIA are powering Read article >  ( 6 min )
    Blender Alpha Release Comes to Omniverse, Introducing Scene Optimization Tools, Improved AI-Powered Character Animation
    Whether creating realistic digital humans that can express emotion or building immersive virtual worlds, 3D artists can reach new heights with NVIDIA Omniverse, a platform for creating and operating metaverse applications. A new Blender alpha release, now available in the Omniverse Launcher, lets users of the 3D graphics software optimize scenes and streamline workflows with Read article >  ( 5 min )
    Making a Splash: AI Can Help Protect Ocean Goers From Deadly Rips
    Surfers, swimmers and beachgoers face a hidden danger in the ocean: rip currents. These narrow channels of water can flow away from the shore at speeds up to 2.5 meters per second, making them one of the biggest safety risks for those enjoying the ocean. To help keep beachgoers safe, Christo Rautenbach, a coastal and Read article >  ( 4 min )
  • Open

    The 5 Crucial Principles To Build A Responsible AI Framework
    Understand how AI can be counterproductive, the need for adopting an Ethical & Responsible AI Framework, and the necessary principles you need to build one. Since the invention of Artificial Intelligence, many enterprises have adopted it in their operations for various reasons. From helping people identify the shortest distance to their destinations to solving high-impact… Read More »The 5 Crucial Principles To Build A Responsible AI Framework The post The 5 Crucial Principles To Build A Responsible AI Framework appeared first on Data Science Central.  ( 23 min )
  • Open

    Deep Multi-Emitter Spectrum Occupancy Mapping that is Robust to the Number of Sensors, Noise and Threshold. (arXiv:2212.10444v2 [eess.SP] UPDATED)
    One of the primary goals in spectrum occupancy mapping is to create a system that is robust to assumptions about the number of sensors, occupancy threshold (in dBm), sensor noise, number of emitters and the propagation environment. We show that such a system may be designed with neural networks using a process of aggregation to allow a variable number of sensors during training and testing. This process transforms the variable number of measurements into approximate log-likelihood ratios (LLRs), which are fed as a fixed-resolution image into a neural network. The use of LLR's provides robustness to the effects of noise and occupancy threshold. In other words, a system may be trained for a nominal number of sensors, threshold and noise levels, and still operate well at various other levels without retraining. Our system operates without knowledge of the number of emitters and does not explicitly attempt to estimate their number or power. Receiver operating curves with realistic propagation environments using topographic maps with commercial network design tools show how performance of the neural network varies with the environment. The use of very low-resolution sensors in this system can still yield good performance.  ( 2 min )
    Hungry Hungry Hippos: Towards Language Modeling with State Space Models. (arXiv:2212.14052v2 [cs.LG] UPDATED)
    State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.  ( 3 min )
    Falsification of Cyber-Physical Systems using Bayesian Optimization. (arXiv:2209.06735v2 [eess.SY] UPDATED)
    Cyber-physical systems (CPSs) are usually complex and safety-critical; hence, it is difficult and important to guarantee that the system's requirements, i.e., specifications, are fulfilled. Simulation-based falsification of CPSs is a practical testing method that can be used to raise confidence in the correctness of the system by only requiring that the system under test can be simulated. As each simulation is typically computationally intensive, an important step is to reduce the number of simulations needed to falsify a specification. We study Bayesian optimization (BO), a sample-efficient method that learns a surrogate model that describes the relationship between the parametrization of possible input signals and the evaluation of the specification. In this paper, we improve the falsification using BO by; first adopting two prominent BO methods, one fits local surrogate models, and the other exploits the user's prior knowledge. Secondly, the formulation of acquisition functions for falsification is addressed in this paper. Benchmark evaluation shows significant improvements in using local surrogate models of BO for falsifying benchmark examples that were previously hard to falsify. Using prior knowledge in the falsification process is shown to be particularly important when the simulation budget is limited. For some of the benchmark problems, the choice of acquisition function clearly affects the number of simulations needed for successful falsification.  ( 2 min )
    Demystifying Approximate Value-based RL with $\epsilon$-greedy Exploration: A Differential Inclusion View. (arXiv:2205.13617v3 [cs.LG] UPDATED)
    Q-learning and SARSA with $\epsilon$-greedy exploration are leading reinforcement learning methods. Their tabular forms converge to the optimal Q-function under reasonable conditions. However, with function approximation, these methods exhibit strange behaviors such as policy oscillation, chattering, and convergence to different attractors (possibly even the worst policy) on different runs, apart from the usual instability. A theory to explain these phenomena has been a long-standing open problem, even for basic linear function approximation (Sutton, 1999). Our work uses differential inclusion to provide the first framework for resolving this problem. We also provide numerical examples to illustrate our framework's prowess in explaining these algorithms' behaviors.  ( 2 min )
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v3 [stat.ML] UPDATED)
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.  ( 2 min )
    On the SDEs and Scaling Rules for Adaptive Gradient Algorithms. (arXiv:2205.10287v2 [cs.LG] UPDATED)
    Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.  ( 2 min )
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v3 [cs.LG] UPDATED)
    We provide a first finite-particle convergence rate for Stein variational gradient descent (SVGD). Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.  ( 2 min )
    DeepProphet2 -- A Deep Learning Gene Recommendation Engine. (arXiv:2208.01918v3 [q-bio.QM] UPDATED)
    New powerful tools for tackling life science problems have been created by recent advances in machine learning. The purpose of the paper is to discuss the potential advantages of gene recommendation performed by artificial intelligence (AI). Indeed, gene recommendation engines try to solve this problem: if the user is interested in a set of genes, which other genes are likely to be related to the starting set and should be investigated? This task was solved with a custom deep learning recommendation engine, DeepProphet2 (DP2), which is freely available to researchers worldwide via https://www.generecommender.com?utm_source=DeepProphet2_paper&utm_medium=pdf. Hereafter, insights behind the algorithm and its practical applications are illustrated. The gene recommendation problem can be addressed by mapping the genes to a metric space where a distance can be defined to represent the real semantic distance between them. To achieve this objective a transformer-based model has been trained on a well-curated freely available paper corpus, PubMed. The paper describes multiple optimization procedures that were employed to obtain the best bias-variance trade-off, focusing on embedding size and network depth. In this context, the model's ability to discover sets of genes implicated in diseases and pathways was assessed through cross-validation. A simple assumption guided the procedure: the network had no direct knowledge of pathways and diseases but learned genes' similarities and the interactions among them. Moreover, to further investigate the space where the neural network represents genes, the dimensionality of the embedding was reduced, and the results were projected onto a human-comprehensible space. In conclusion, a set of use cases illustrates the algorithm's potential applications in a real word setting.  ( 2 min )
    APOLLO: An Optimized Training Approach for Long-form Numerical Reasoning. (arXiv:2212.07249v2 [cs.CL] UPDATED)
    Long-form numerical reasoning in financial analysis aims to generate a reasoning program to calculate the correct answer for a given question. Previous work followed a retriever-generator framework, where the retriever selects key facts from a long-form document, and the generator generates a reasoning program based on retrieved facts. However, they treated all facts equally without considering the different contributions of facts with and without numbers. Meanwhile, the program consistency were ignored under supervised training, resulting in lower training accuracy and diversity. To solve these problems, we proposed APOLLO to improve the long-form numerical reasoning framework. For the retriever, we adopt a number-aware negative sampling strategy to enable the retriever to be more discriminative on key numerical facts. For the generator, we design consistency-based reinforcement learning and target program augmentation strategy based on the consistency of program execution results. Experimental results on the FinQA and ConvFinQA leaderboard verify the effectiveness of our proposed method, achieving the new state-of-the-art.  ( 2 min )
    Counterfactual Fairness Is Basically Demographic Parity. (arXiv:2208.03843v3 [cs.LG] UPDATED)
    Making fair decisions is crucial to ethically implementing machine learning algorithms in social settings. In this work, we consider the celebrated definition of counterfactual fairness [Kusner et al., NeurIPS, 2017]. We begin by showing that an algorithm which satisfies counterfactual fairness also satisfies demographic parity, a far simpler fairness constraint. Similarly, we show that all algorithms satisfying demographic parity can be trivially modified to satisfy counterfactual fairness. Together, our results indicate that counterfactual fairness is basically equivalent to demographic parity, which has important implications for the growing body of work on counterfactual fairness. We then validate our theoretical findings empirically, analyzing three existing algorithms for counterfactual fairness against three simple benchmarks. We find that two simple benchmark algorithms outperform all three existing algorithms -- in terms of fairness, accuracy, and efficiency -- on several data sets. Our analysis leads us to formalize a concrete fairness goal: to preserve the order of individuals within protected groups. We believe transparency around the ordering of individuals within protected groups makes fair algorithms more trustworthy. By design, the two simple benchmark algorithms satisfy this goal while the existing algorithms for counterfactual fairness do not.  ( 2 min )
    A Physics-informed Diffusion Model for High-fidelity Flow Field Reconstruction. (arXiv:2211.14680v2 [cs.LG] UPDATED)
    Machine learning models are gaining increasing popularity in the domain of fluid dynamics for their potential to accelerate the production of high-fidelity computational fluid dynamics data. However, many recently proposed machine learning models for high-fidelity data reconstruction require low-fidelity data for model training. Such requirement restrains the application performance of these models, since their data reconstruction accuracy would drop significantly if the low-fidelity input data used in model test has a large deviation from the training data. To overcome this restraint, we propose a diffusion model which only uses high-fidelity data at training. With different configurations, our model is able to reconstruct high-fidelity data from either a regular low-fidelity sample or a sparsely measured sample, and is also able to gain an accuracy increase by using physics-informed conditioning information from a known partial differential equation when that is available. Experimental results demonstrate that our model can produce accurate reconstruction results for 2d turbulent flows based on different input sources without retraining.  ( 2 min )
    Conv-NILM-Net, a causal and multi-appliance model for energy source separation. (arXiv:2208.02173v2 [eess.SP] UPDATED)
    Non-Intrusive Load Monitoring (NILM) seeks to save energy by estimating individual appliance power usage from a single aggregate measurement. Deep neural networks have become increasingly popular in attempting to solve NILM problems. However most used models are used for Load Identification rather than online Source Separation. Among source separation models, most use a single-task learning approach in which a neural network is trained exclusively for each appliance. This strategy is computationally expensive and ignores the fact that multiple appliances can be active simultaneously and dependencies between them. The rest of models are not causal, which is important for real-time application. Inspired by Convtas-Net, a model for speech separation, we propose Conv-NILM-net, a fully convolutional framework for end-to-end NILM. Conv-NILM-net is a causal model for multi appliance source separation. Our model is tested on two real datasets REDD and UK-DALE and clearly outperforms the state of the art while keeping a significantly smaller size than the competing models.  ( 2 min )
    Improving Performance in Neural Networks by Dendrites-Activated Connections. (arXiv:2301.00924v2 [cs.NE] UPDATED)
    Computational units in artificial neural networks compute a linear combination of their inputs, and then apply a nonlinear filter, often a ReLU shifted by some bias, and if the inputs come themselves from other units, they were already filtered with their own biases. In a layer, multiple units share the same inputs, and each input was filtered with a unique bias, resulting in output values being based on shared input biases rather than individual optimal ones. To mitigate this issue, we introduce DAC, a new computational unit based on preactivation and multiple biases, where input signals undergo independent nonlinear filtering before the linear combination. We provide a Keras implementation and report its computational efficiency. We test DAC convolutions in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and Imagewoof, and achieve performance improvements of up to 1.73%. We exhibit examples where DAC is more efficient than its standard counterpart as a function approximator, and we prove a universal representation theorem.  ( 2 min )
    A picture of the space of typical learnable tasks. (arXiv:2210.17011v2 [cs.LG] UPDATED)
    We develop information geometric techniques to understand the representations learned by deep networks when they are trained on different tasks using supervised, meta-, semi-supervised and contrastive learning. We shed light on the following phenomena that relate to the structure of the space of tasks: (1) the manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional; (2) supervised learning on one task results in a surprising amount of progress even on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) the structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) episodic meta-learning algorithms and supervised learning traverse different trajectories during training but they fit similar models eventually; (5) contrastive and semi-supervised learning methods traverse trajectories similar to those of supervised learning. We use classification tasks constructed from the CIFAR-10 and Imagenet datasets to study these phenomena.  ( 2 min )
    Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design. (arXiv:2302.02913v2 [cs.LG] UPDATED)
    Deep generative models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion Models, and Transformers, have shown great promise in a variety of applications, including image and speech synthesis, natural language processing, and drug discovery. However, when applied to engineering design problems, evaluating the performance of these models can be challenging, as traditional statistical metrics based on likelihood may not fully capture the requirements of engineering applications. This paper doubles as a review and a practical guide to evaluation metrics for deep generative models (DGMs) in engineering design. We first summarize well-accepted `classic' evaluation metrics for deep generative models grounded in machine learning theory and typical computer science applications. Using case studies, we then highlight why these metrics seldom translate well to design problems but see frequent use due to the lack of established alternatives. Next, we curate a set of design-specific metrics which have been proposed across different research communities and can be used for evaluating deep generative models. These metrics focus on unique requirements in design and engineering, such as constraint satisfaction, functional performance, novelty, and conditioning. We structure our review and discussion as a set of practical selection criteria and usage guidelines. Throughout our discussion, we apply the metrics to models trained on simple 2-dimensional example problems. Finally, to illustrate the selection process and classic usage of the presented metrics, we evaluate three deep generative models on a multifaceted bicycle frame design problem considering performance target achievement, design novelty, and geometric constraints. We publicly release the code for the datasets, models, and metrics used throughout the paper at decode.mit.edu/projects/metrics/.  ( 2 min )
    An Empirical Study of Deep Learning Models for Vulnerability Detection. (arXiv:2212.08109v3 [cs.SE] UPDATED)
    Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR. We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models' outputs. We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once. We explored the types of programs DL may consider "hard" to handle. We investigated the relations of training data sizes and training data composition with model performance. Finally, we studied model interpretations and analyzed important features that the models used to make predictions. We believe that our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models. All of our datasets, code, and results are available at https://doi.org/10.6084/m9.figshare.20791240.  ( 2 min )
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v5 [cs.LG] UPDATED)
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.  ( 2 min )
    Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model. (arXiv:2211.10590v3 [cs.LG] UPDATED)
    Despite the impressive successes of deep learning approaches for various chemical problems such as property prediction, virtual screening, and de novo molecule design, separately designed models for specific tasks are usually required, and it is often difficult to synergistically combine these models for novel tasks. To address this, here we present a bidirectional molecular foundation model that can be used for both molecular structure and property inferences through a single model, inspired by recent multimodal learning methods such as VLP. Furthermore, thanks to the outstanding structure/property alignment in a common embedding space, experimental results confirm that our method leads to state-of-the-art performance and interpretable attention maps in both multimodal and unimodal tasks, including conditional molecule generation, property prediction, molecule classification, and reaction prediction.  ( 2 min )
    Machine Learning for Optical Motion Capture-driven Musculoskeletal Modelling from Inertial Motion Capture Data. (arXiv:2209.14456v2 [cs.LG] UPDATED)
    Marker-based Optical Motion Capture (OMC) systems and associated musculoskeletal (MSK) modelling predictions offer non-invasively obtainable insights into in vivo joint and muscle loading, aiding clinical decision-making. However, an OMC system is lab-based, expensive, and requires a line of sight. Inertial Motion Capture (IMC) systems are widely-used alternatives, which are portable, user-friendly, and relatively low-cost, although with lesser accuracy. Irrespective of the choice of motion capture technique, one needs to use an MSK model to obtain the kinematic and kinetic outputs, which is a computationally expensive tool increasingly well approximated by machine learning (ML) methods. Here, we present an ML approach to map experimentally recorded IMC data to the human upper-extremity MSK model outputs computed from ('gold standard') OMC input data. Essentially, we aim to predict higher-quality MSK outputs from the much easier-to-obtain IMC data. We use OMC and IMC data simultaneously collected for the same subjects to train different ML architectures that predict OMC-driven MSK outputs from IMC measurements. In particular, we employed various neural network (NN) architectures, such as Feed-Forward Neural Networks (FFNNs) and Recurrent Neural Networks (RNNs) (vanilla, Long Short-Term Memory, and Gated Recurrent Unit) and searched for the best-fit model through an exhaustive search in the hyperparameters space in both subject-exposed (SE) & subject-naive (SN) settings. We observed a comparable performance for both FFNN & RNN models, which have a high degree of agreement (ravg, SE, FFNN = 0.90+/-0.19, ravg, SE, RNN = 0.89+/-0.17, ravg, SN, FFNN = 0.84+/-0.23, & ravg, SN, RNN = 0.78+/-0.23) with the desired OMC-driven MSK estimates for held-out test data. Mapping IMC inputs to OMC-driven MSK outputs using ML models could be instrumental in transitioning MSK modelling from 'lab to field'.  ( 3 min )
    The Debate Over Understanding in AI's Large Language Models. (arXiv:2210.13966v3 [cs.LG] UPDATED)
    We survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to "understand" language -- and the physical and social situations language encodes -- in any important sense. We describe arguments that have been made for and against such understanding, and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that a new science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition.  ( 2 min )
    Transformers in Time Series: A Survey. (arXiv:2202.07125v4 [cs.LG] UPDATED)
    Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interest in the time series community. Among multiple advantages of Transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications. In this paper, we systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series Transformers in two perspectives. From the perspective of network structure, we summarize the adaptations and modifications that have been made to Transformers in order to accommodate the challenges in time series analysis. From the perspective of applications, we categorize time series Transformers based on common tasks including forecasting, anomaly detection, and classification. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how Transformers perform in time series. Finally, we discuss and suggest future directions to provide useful research guidance. A corresponding resource that has been continuously updated can be found in the GitHub repository. To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers.  ( 2 min )
    PatchBlender: A Motion Prior for Video Transformers. (arXiv:2211.14449v2 [cs.CV] UPDATED)
    Transformers have become one of the dominant architectures in the field of computer vision. However, there are yet several challenges when applying such architectures to video data. Most notably, these models struggle to model the temporal patterns of video data effectively. Directly targeting this issue, we introduce PatchBlender, a learnable blending function that operates over patch embeddings across the temporal dimension of the latent space. We show that our method is successful at enabling vision transformers to encode the temporal component of video data. On Something-Something v2 and MOVi-A, we show that our method improves the baseline performance of video Transformers. PatchBlender has the advantage of being compatible with almost any Transformer architecture and since it is learnable, the model can adaptively turn on or off the prior. It is also extremely lightweight compute-wise, 0.005% the GFLOPs of a ViT-B.  ( 2 min )
    Benchmarking Bayesian neural networks and evaluation metrics for regression tasks. (arXiv:2206.06779v2 [cs.LG] UPDATED)
    Due to the growing adoption of deep neural networks in many fields of science and engineering, modeling and estimating their uncertainties has become of primary importance. Despite the growing literature about uncertainty quantification in deep learning, the quality of the uncertainty estimates remains an open question. In this work, we assess for the first time the performance of several approximation methods for Bayesian neural networks on regression tasks by evaluating the quality of the confidence regions with several coverage metrics. The selected algorithms are also compared in terms of predictivity, kernelized Stein discrepancy and maximum mean discrepancy with respect to a reference posterior in both weight and function space. Our findings show that (i) some algorithms have excellent predictive performance but tend to largely over or underestimate uncertainties (ii) it is possible to achieve good accuracy and a given target coverage with finely tuned hyperparameters and (iii) the promising kernel Stein discrepancy cannot be exclusively relied on to assess the posterior approximation. As a by-product of this benchmark, we also compute and visualize the similarity of all algorithms and corresponding hyperparameters: interestingly we identify a few clusters of algorithms with similar behavior in weight space, giving new insights on how they explore the posterior distribution.
    Calibrated Forecasts: The Minimax Proof. (arXiv:2209.05863v2 [econ.TH] UPDATED)
    A formal write-up of the simple proof (1995) of the existence of calibrated forecasts by the minimax theorem, which moreover shows that $N^3$ periods suffice to guarantee a calibration error of at most $1/N$.  ( 2 min )
    Vote'n'Rank: Revision of Benchmarking with Social Choice Theory. (arXiv:2210.05769v3 [cs.LG] UPDATED)
    The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote'n'Rank's procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.
    Robust Causal Graph Representation Learning against Confounding Effects. (arXiv:2208.08584v2 [cs.LG] UPDATED)
    The prevailing graph neural network models have achieved significant progress in graph representation learning. However, in this paper, we uncover an ever-overlooked phenomenon: the pre-trained graph representation learning model tested with full graphs underperforms the model tested with well-pruned graphs. This observation reveals that there exist confounders in graphs, which may interfere with the model learning semantic information, and current graph representation learning methods have not eliminated their influence. To tackle this issue, we propose Robust Causal Graph Representation Learning (RCGRL) to learn robust graph representations against confounding effects. RCGRL introduces an active approach to generate instrumental variables under unconditional moment restrictions, which empowers the graph representation learning model to eliminate confounders, thereby capturing discriminative information that is causally related to downstream predictions. We offer theorems and proofs to guarantee the theoretical effectiveness of the proposed approach. Empirically, we conduct extensive experiments on a synthetic dataset and multiple benchmark datasets. The results demonstrate that compared with state-of-the-art methods, RCGRL achieves better prediction performance and generalization ability.
    DocILE Benchmark for Document Information Localization and Extraction. (arXiv:2302.05658v1 [cs.CL])
    This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at https://github.com/rossumai/docile.  ( 2 min )
    ASR Bundestag: A Large-Scale political debate dataset in German. (arXiv:2302.06008v1 [cs.CL])
    We present ASR Bundestag, a dataset for automatic speech recognition in German, consisting of 610 hours of aligned audio-transcript pairs for supervised training as well as 1,038 hours of unlabeled audio snippets for self-supervised learning, based on raw audio data and transcriptions from plenary sessions and committee meetings of the German parliament. In addition, we discuss utilized approaches for the automated creation of speech datasets and assess the quality of the resulting dataset based on evaluations and finetuning of a pre-trained state of the art model. We make the dataset publicly available, including all subsets.  ( 2 min )
    Discriminative Radial Domain Adaptation. (arXiv:2301.00383v2 [cs.LG] UPDATED)
    Domain adaptation methods reduce domain shift typically by learning domain-invariant features. Most existing methods are built on distribution matching, e.g., adversarial domain adaptation, which tends to corrupt feature discriminability. In this paper, we propose Discriminative Radial Domain Adaptation (DRDA) which bridges source and target domains via a shared radial structure. It's motivated by the observation that as the model is trained to be progressively discriminative, features of different categories expand outwards in different directions, forming a radial structure. We show that transferring such an inherently discriminative structure would enable to enhance feature transferability and discriminability simultaneously. Specifically, we represent each domain with a global anchor and each category a local anchor to form a radial structure and reduce domain shift via structure matching. It consists of two parts, namely isometric transformation to align the structure globally and local refinement to match each category. To enhance the discriminability of the structure, we further encourage samples to cluster close to the corresponding local anchors based on optimal-transport assignment. Extensively experimenting on multiple benchmarks, our method is shown to consistently outperforms state-of-the-art approaches on varied tasks, including the typical unsupervised domain adaptation, multi-source domain adaptation, domain-agnostic learning, and domain generalization.  ( 2 min )
    When to Update Your Model: Constrained Model-based Reinforcement Learning. (arXiv:2210.08349v3 [cs.LG] UPDATED)
    Designing and analyzing model-based RL (MBRL) algorithms with guaranteed monotonic improvement has been challenging, mainly due to the interdependence between policy optimization and model learning. Existing discrepancy bounds generally ignore the impacts of model shifts, and their corresponding algorithms are prone to degrade performance by drastic model updating. In this work, we first propose a novel and general theoretical scheme for a non-decreasing performance guarantee of MBRL. Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. These discoveries encourage us to formulate a constrained lower-bound optimization problem to permit the monotonicity of MBRL. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns. Motivated by these analyses, we design a simple but effective algorithm CMLO (Constrained Model-shift Lower-bound Optimization), by introducing an event-triggered mechanism that flexibly determines when to update the model. Experiments show that CMLO surpasses other state-of-the-art methods and produces a boost when various policy optimization methods are employed.  ( 2 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v6 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    On Parameter Estimation in Unobserved Components Models subject to Linear Inequality Constraints. (arXiv:2110.12149v2 [econ.EM] UPDATED)
    We propose a new \textit{quadratic programming-based} method of approximating a nonstandard density using a multivariate Gaussian density. Such nonstandard densities usually arise while developing posterior samplers for unobserved components models involving inequality constraints on the parameters. For instance, Chan et al. (2016) provided a new model of trend inflation with linear inequality constraints on the stochastic trend. We implemented the proposed quadratic programming-based method for this model and compared it to the existing approximation. We observed that the proposed method works as well as the existing approximation in terms of the final trend estimates while achieving gains in terms of sample efficiency.  ( 2 min )
    Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. (arXiv:2210.12316v2 [cs.IR] UPDATED)
    Recently, the generality of natural language text has been leveraged to develop transferable recommender systems. The basic idea is to employ pre-trained language models~(PLM) to encode item text into item representations. Despite the promising transferability, the binding between item text and item representations might be too tight, leading to potential problems such as over-emphasizing the effect of text features and exaggerating the negative impact of domain gap. To address this issue, this paper proposes VQ-Rec, a novel approach to learning Vector-Quantized item representations for transferable sequential Recommenders. The main novelty of our approach lies in the new item representation scheme: it first maps item text into a vector of discrete indices (called item code), and then employs these indices to lookup the code embedding table for deriving item representations. Such a scheme can be denoted as "text $\Longrightarrow$ code $\Longrightarrow$ representation". Based on this representation scheme, we further propose an enhanced contrastive pre-training approach, using semi-synthetic and mixed-domain code representations as hard negatives. Furthermore, we design a new cross-domain fine-tuning method based on a differentiable permutation-based network. Extensive experiments conducted on six public benchmarks demonstrate the effectiveness of the proposed approach, in both cross-domain and cross-platform settings. Code and pre-trained model are available at: https://github.com/RUCAIBox/VQ-Rec.  ( 2 min )
    Bootstrapping Multilingual Semantic Parsers using Large Language Models. (arXiv:2210.07313v2 [cs.CL] UPDATED)
    Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-train paradigm of transferring English datasets across multiple languages remains to be a key mechanism for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human-annotated translation pairs. Further, translation services may continue to be brittle due to domain mismatch between task-specific input text and general-purpose text used for training translation models. For multilingual semantic parsing, we demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting. Through extensive comparisons on two public datasets, MTOP and MASSIVE, spanning 50 languages and several domains, we show that our method of translating data using LLMs outperforms a strong translate-train baseline on 41 out of 50 languages. We study the key design choices that enable more effective multilingual data translation via prompted LLMs.  ( 2 min )
    Blessing of Class Diversity in Pre-training. (arXiv:2209.03447v3 [cs.LG] UPDATED)
    This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.  ( 2 min )
    Sparse Mutation Decompositions: Fine Tuning Deep Neural Networks with Subspace Evolution. (arXiv:2302.05832v1 [cs.NE])
    Neuroevolution is a promising area of research that combines evolutionary algorithms with neural networks. A popular subclass of neuroevolutionary methods, called evolution strategies, relies on dense noise perturbations to mutate networks, which can be sample inefficient and challenging for large models with millions of parameters. We introduce an approach to alleviating this problem by decomposing dense mutations into low-dimensional subspaces. Restricting mutations in this way can significantly reduce variance as networks can handle stronger perturbations while maintaining performance, which enables a more controlled and targeted evolution of deep networks. This approach is uniquely effective for the task of fine tuning pre-trained models, which is an increasingly valuable area of research as networks continue to scale in size and open source models become more widely available. Furthermore, we show how this work naturally connects to ensemble learning where sparse mutations encourage diversity among children such that their combined predictions can reliably improve performance. We conduct the first large scale exploration of neuroevolutionary fine tuning and ensembling on the notoriously difficult ImageNet dataset, where we see small generalization improvements with only a single evolutionary generation using nearly a dozen different deep neural network architectures.  ( 2 min )
    Relational Local Explanations. (arXiv:2212.12374v2 [cs.LG] UPDATED)
    The majority of existing post-hoc explanation approaches for machine learning models produce independent, per-variable feature attribution scores, ignoring a critical inherent characteristics of homogeneously structured data, such as visual or text data: there exist latent inter-variable relationships between features. In response, we develop a novel model-agnostic and permutation-based feature attribution approach based on the relational analysis between input variables. As a result, we are able to gain a broader insight into the predictions and decisions of machine learning models. Experimental evaluations of our framework in comparison with state-of-the-art attribution techniques on various setups involving both image and text data modalities demonstrate the effectiveness and validity of our method.  ( 2 min )
    Reinforcement Learning with Almost Sure Constraints. (arXiv:2112.05198v3 [cs.LG] UPDATED)
    In this work we address the problem of finding feasible policies for Constrained Markov Decision Processes under probability one constraints. We argue that stationary policies are not sufficient for solving this problem, and that a rich class of policies can be found by endowing the controller with a scalar quantity, so called budget, that tracks how close the agent is to violating the constraint. We show that the minimal budget required to act safely can be obtained as the smallest fixed point of a Bellman-like operator, for which we analyze its convergence properties. We also show how to learn this quantity when the true kernel of the Markov decision process is not known, while providing sample-complexity bounds. The utility of knowing this minimal budget relies in that it can aid in the search of optimal or near-optimal policies by shrinking down the region of the state space the agent must navigate. Simulations illustrate the different nature of probability one constraints against the typically used constraints in expectation.  ( 2 min )
    Physics informed WNO. (arXiv:2302.05925v1 [stat.ML])
    Deep neural operators are recognized as an effective tool for learning solution operators of complex partial differential equations (PDEs). As compared to laborious analytical and computational tools, a single neural operator can predict solutions of PDEs for varying initial or boundary conditions and different inputs. A recently proposed Wavelet Neural Operator (WNO) is one such operator that harnesses the advantage of time-frequency localization of wavelets to capture the manifolds in the spatial domain effectively. While WNO has proven to be a promising method for operator learning, the data-hungry nature of the framework is a major shortcoming. In this work, we propose a physics-informed WNO for learning the solution operators of families of parametric PDEs without labeled training data. The efficacy of the framework is validated and illustrated with four nonlinear spatiotemporal systems relevant to various fields of engineering and science.  ( 2 min )
    DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. (arXiv:2210.01776v2 [q-bio.BM] UPDATED)
    Predicting the binding structure of a small molecule ligand to a protein -- a task known as molecular docking -- is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, while previous methods are not able to dock on computationally folded structures (maximum accuracy 10.4%), DiffDock maintains significantly higher precision (21.7%). Finally, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.  ( 2 min )
    Global-Local Regularization Via Distributional Robustness. (arXiv:2203.00553v3 [cs.LG] UPDATED)
    Despite superior performance in many situations, deep neural networks are often vulnerable to adversarial examples and distribution shifts, limiting model generalization ability in real-world applications. To alleviate these problems, recent approaches leverage distributional robustness optimization (DRO) to find the most challenging distribution, and then minimize loss function over this most challenging distribution. Regardless of achieving some improvements, these DRO approaches have some obvious limitations. First, they purely focus on local regularization to strengthen model robustness, missing a global regularization effect which is useful in many real-world applications (e.g., domain adaptation, domain generalization, and adversarial machine learning). Second, the loss functions in the existing DRO approaches operate in only the most challenging distribution, hence decouple with the original distribution, leading to a restrictive modeling capability. In this paper, we propose a novel regularization technique, following the veins of Wasserstein-based DRO framework. Specifically, we define a particular joint distribution and Wasserstein-based uncertainty, allowing us to couple the original and most challenging distributions for enhancing modeling capability and applying both local and global regularizations. Empirical studies on different learning problems demonstrate that our proposed approach significantly outperforms the existing regularization approaches in various domains: semi-supervised learning, domain adaptation, domain generalization, and adversarial machine learning.  ( 2 min )
    A New Approach to Drifting Games, Based on Asymptotically Optimal Potentials. (arXiv:2207.11405v2 [cs.LG] UPDATED)
    We develop a new approach to drifting games, a class of two-person games with many applications to boosting and online learning settings. Our approach involves (a) guessing an asymptotically optimal potential by solving an associated partial differential equation (PDE); then (b) justifying the guess, by proving upper and lower bounds on the final-time loss whose difference scales like a negative power of the number of time steps. The proofs of our potential-based upper bounds are elementary, using little more than Taylor expansion. The proofs of our potential-based lower bounds are also elementary, combining Taylor expansion with probabilistic or combinatorial arguments. Not only is our approach more elementary, but we give new potentials and derive corresponding upper and lower bounds that match each other in the asymptotic regime.  ( 2 min )
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v3 [cs.LG] UPDATED)
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.  ( 2 min )
    Towards Fine-tuning Pre-trained Language Models with Integer Forward and Backward Propagation. (arXiv:2209.09815v2 [cs.LG] UPDATED)
    The large number of parameters of some prominent language models, such as BERT, makes their fine-tuning on downstream tasks computationally intensive and energy hungry. Previously researchers were focused on lower bit-width integer data types for the forward propagation of language models to save memory and computation. As for the backward propagation, however, only 16-bit floating-point data type has been used for the fine-tuning of BERT. In this work, we use integer arithmetic for both forward and back propagation in the fine-tuning of BERT. We study the effects of varying the integer bit-width on the model's metric performance. Our integer fine-tuning uses integer arithmetic to perform forward propagation and gradient computation of linear, layer-norm, and embedding layers of BERT. We fine-tune BERT using our integer training method on SQuAD v1.1 and SQuAD v2., and GLUE benchmark. We demonstrate that metric performance of fine-tuning 16-bit integer BERT matches both 16-bit and 32-bit floating-point baselines. Furthermore, using the faster and more memory efficient 8-bit integer data type, integer fine-tuning of BERT loses an average of 3.1 points compared to the FP32 baseline.  ( 2 min )
    Jointly Contrastive Representation Learning on Road Network and Trajectory. (arXiv:2209.06389v2 [cs.LG] UPDATED)
    Road network and trajectory representation learning are essential for traffic systems since the learned representation can be directly used in various downstream tasks (e.g., traffic speed inference, and travel time estimation). However, most existing methods only contrast within the same scale, i.e., treating road network and trajectory separately, which ignores valuable inter-relations. In this paper, we aim to propose a unified framework that jointly learns the road network and trajectory representations end-to-end. We design domain-specific augmentations for road-road contrast and trajectory-trajectory contrast separately, i.e., road segment with its contextual neighbors and trajectory with its detour replaced and dropped alternatives, respectively. On top of that, we further introduce the road-trajectory cross-scale contrast to bridge the two scales by maximizing the total mutual information. Unlike the existing cross-scale contrastive learning methods on graphs that only contrast a graph and its belonging nodes, the contrast between road segment and trajectory is elaborately tailored via novel positive sampling and adaptive weighting strategies. We conduct prudent experiments based on two real-world datasets with four downstream tasks, demonstrating improved performance and effectiveness. The code is available at https://github.com/mzy94/JCLRNT.  ( 2 min )
    Behavior Prior Representation learning for Offline Reinforcement Learning. (arXiv:2211.00863v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at \url{https://github.com/bit1029public/offline_bpr}.  ( 2 min )
    An efficient encoder-decoder architecture with top-down attention for speech separation. (arXiv:2209.15200v4 [cs.SD] UPDATED)
    Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain's top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet's multiply-accumulate operations (MACs) are only 5\% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10\% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10\% of Sepformer and the CPU inference time only 24\% of Sepformer.  ( 2 min )
    Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators. (arXiv:2210.07219v2 [cs.DS] UPDATED)
    We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-\alpha^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert \alpha\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $\epsilon$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice.  ( 2 min )
    Meta-Learning Based Knowledge Extrapolation for Temporal Knowledge Graph. (arXiv:2302.05640v1 [cs.AI])
    In the last few years, the solution to Knowledge Graph (KG) completion via learning embeddings of entities and relations has attracted a surge of interest. Temporal KGs(TKGs) extend traditional Knowledge Graphs (KGs) by associating static triples with timestamps forming quadruples. Different from KGs and TKGs in the transductive setting, constantly emerging entities and relations in incomplete TKGs create demand to predict missing facts with unseen components, which is the extrapolation setting. Traditional temporal knowledge graph embedding (TKGE) methods are limited in the extrapolation setting since they are trained within a fixed set of components. In this paper, we propose a Meta-Learning based Temporal Knowledge Graph Extrapolation (MTKGE) model, which is trained on link prediction tasks sampled from the existing TKGs and tested in the emerging TKGs with unseen entities and relations. Specifically, we meta-train a GNN framework that captures relative position patterns and temporal sequence patterns between relations. The learned embeddings of patterns can be transferred to embed unseen components. Experimental results on two different TKG extrapolation datasets show that MTKGE consistently outperforms both the existing state-of-the-art models for knowledge graph extrapolation and specifically adapted KGE and TKGE baselines.  ( 2 min )
    An Upper Bound for the Distribution Overlap Index and Its Applications. (arXiv:2212.08701v2 [cs.LG] UPDATED)
    This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions without requiring any knowledge of the distribution models. The computation of our bound is time-efficient and memory-efficient and only requires finite samples. The proposed bound shows its value in one-class classification and domain shift analysis. Specifically, in one-class classification, we build a novel one-class classifier by converting the bound into a confidence score function. Unlike most one-class classifiers, the training process is not needed for our classifier. Additionally, the experimental results show that our classifier can be accurate with only a small number of in-class samples and outperform many state-of-the-art methods on various datasets in different one-class classification scenarios. In domain shift analysis, we propose a theorem based on our bound. The theorem is useful in detecting the existence of domain shift and inferring data information. The detection and inference processes are both computation-efficient and memory-efficient. Our work shows significant promise toward broadening the applications of overlap-based metrics.  ( 2 min )
    Hierarchical Optimization-Derived Learning. (arXiv:2302.05587v1 [cs.LG])
    In recent years, by utilizing optimization techniques to formulate the propagation of deep model, a variety of so-called Optimization-Derived Learning (ODL) approaches have been proposed to address diverse learning and vision tasks. Although having achieved relatively satisfying practical performance, there still exist fundamental issues in existing ODL methods. In particular, current ODL methods tend to consider model construction and learning as two separate phases, and thus fail to formulate their underlying coupling and depending relationship. In this work, we first establish a new framework, named Hierarchical ODL (HODL), to simultaneously investigate the intrinsic behaviors of optimization-derived model construction and its corresponding learning process. Then we rigorously prove the joint convergence of these two sub-tasks, from the perspectives of both approximation quality and stationary analysis. To our best knowledge, this is the first theoretical guarantee for these two coupled ODL components: optimization and learning. We further demonstrate the flexibility of our framework by applying HODL to challenging learning tasks, which have not been properly addressed by existing ODL methods. Finally, we conduct extensive experiments on both synthetic data and real applications in vision and other learning tasks to verify the theoretical properties and practical performance of HODL in various application scenarios.  ( 2 min )
    FusionRetro: Molecule Representation Fusion via Reaction Graph for Retrosynthetic Planning. (arXiv:2209.15315v2 [cs.LG] UPDATED)
    Retrosynthetic planning is a fundamental problem in drug discovery and organic chemistry, which aims to find a complete multi-step synthetic route from a set of starting materials to the target molecule, determining crucial process flow in chemical production. Existing approaches combine single-step retrosynthesis models and search algorithms to find synthetic routes. However, these approaches generally consider the two pieces in a decoupled manner, taking only the product as the input to predict the reactants per planning step and largely ignoring the important context information from other intermediates along the synthetic route. In this work, we perform a series of experiments to identify the limitations of this decoupled view and propose a novel retrosynthesis framework that also exploits context information for retrosynthetic planning. We view synthetic routes as reaction graphs, and propose to incorporate the context by three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. The whole framework can be efficiently optimized in an end-to-end fashion. Comprehensive experiments show that by fusing in context information over routes, our model significantly improves the performance of retrosynthetic planning over baselines that are not context-aware, especially for long synthetic routes.  ( 2 min )
    Multi-Scored Sleep Databases: How to Exploit the Multiple-Labels in Automated Sleep Scoring. (arXiv:2207.01910v3 [cs.LG] UPDATED)
    Study Objectives: Inter-scorer variability in scoring polysomnograms is a well-known problem. Most of the existing automated sleep scoring systems are trained using labels annotated by a single scorer, whose subjective evaluation is transferred to the model. When annotations from two or more scorers are available, the scoring models are usually trained on the scorer consensus. The averaged scorer's subjectivity is transferred into the model, losing information about the internal variability among different scorers. In this study, we aim to insert the multiple-knowledge of the different physicians into the training procedure. The goal is to optimize a model training, exploiting the full information that can be extracted from the consensus of a group of scorers. Methods: We train two lightweight deep learning based models on three different multi-scored databases. We exploit the label smoothing technique together with a soft-consensus (LSSC) distribution to insert the multiple-knowledge in the training procedure of the model. We introduce the averaged cosine similarity metric (ACS) to quantify the similarity between the hypnodensity-graph generated by the models with-LSSC and the hypnodensity-graph generated by the scorer consensus. Results: The performance of the models improves on all the databases when we train the models with our LSSC. We found an increase in ACS (up to 6.4%) between the hypnodensity-graph generated by the models trained with-LSSC and the hypnodensity-graph generated by the consensus. Conclusion: Our approach definitely enables a model to better adapt to the consensus of the group of scorers. Future work will focus on further investigations on different scoring architectures and hopefully large-scale-heterogeneous multi-scored datasets.  ( 2 min )
    Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization. (arXiv:2302.05865v1 [cs.LG])
    Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We extend the current state-of-the-art aggregators and propose an optimization-based subspace estimator by modeling pairwise distances as quadratic functions by utilizing the recently introduced Flag Median problem. The estimator in our loss function favors the pairs that preserve the norm of the difference vector. We theoretically show that our approach enhances the robustness of state-of-the-art byzantine resilient aggregators. Also, we evaluate our method with different tasks in a distributed setup with a parameter server architecture and show its communication efficiency while maintaining similar accuracy. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator  ( 2 min )
    Effects of Image Size on Deep Learning. (arXiv:2101.11508v7 [cs.CV] UPDATED)
    In this work, the best size for late gadolinium enhancement (LGE) magnetic resonance imaging (MRI) images in the training dataset was determined to optimize deep learning training outcomes. Non-extra pixel and extra pixel interpolation algorithms were used to determine the new size of the LGE-MRI images. A novel strategy was introduced to handle interpolation masks and remove extra class labels in interpolated ground truth (GT) segmentation masks. The expectation maximization, weighted intensity, a priori information (EWA) algorithm was used for quantification of myocardial infarction (MI) in automatically segmented LGE-MRI images. Arbitrary threshold, comparison of the sums, and sums of differences are methods used to estimate the relationship between semi-automatic or manual and fully automated quantification of myocardial infarction (MI) results. The relationship between semi-automatic and fully automated quantification of MI results was found to be closer in the case of bigger LGE MRI images (55.5% closer to manual results) than in the case of smaller LGE MRI images (22.2% closer to manual results).  ( 2 min )
    On Narrative Information and the Distillation of Stories. (arXiv:2211.12423v2 [cs.CL] UPDATED)
    The act of telling stories is a fundamental part of what it means to be human. This work introduces the concept of narrative information, which we define to be the overlap in information space between a story and the items that compose the story. Using contrastive learning methods, we show how modern artificial neural networks can be leveraged to distill stories and extract a representation of the narrative information. We then demonstrate how evolutionary algorithms can leverage this to extract a set of narrative templates and how these templates -- in tandem with a novel curve-fitting algorithm we introduce -- can reorder music albums to automatically induce stories in them. In the process of doing so, we give strong statistical evidence that these narrative information templates are present in existing albums. While we experiment only with music albums here, the premises of our work extend to any form of (largely) independent media.  ( 2 min )
    USER: Unsupervised Structural Entropy-based Robust Graph Neural Network. (arXiv:2302.05889v1 [cs.LG])
    Unsupervised/self-supervised graph neural networks (GNN) are vulnerable to inherent randomness in the input graph data which greatly affects the performance of the model in downstream tasks. In this paper, we alleviate the interference of graph randomness and learn appropriate representations of nodes without label information. To this end, we propose USER, an unsupervised robust version of graph neural networks that is based on structural entropy. We analyze the property of intrinsic connectivity and define intrinsic connectivity graph. We also identify the rank of the adjacency matrix as a crucial factor in revealing a graph that provides the same embeddings as the intrinsic connectivity graph. We then introduce structural entropy in the objective function to capture such a graph. Extensive experiments conducted on clustering and link prediction tasks under random-noises and meta-attack over three datasets show USER outperforms benchmarks and is robust to heavier randomness.  ( 2 min )
    Exploration of carbonate aggregates in road construction using ultrasonic and artificial intelligence approaches. (arXiv:2302.05884v1 [cs.LG])
    The COVID-19 pandemic has significantly impacted the construction sector, which is sensitive to economic cycles. In order to boost value and efficiency in this sector, the use of innovative exploration technologies such as ultrasonic and Artificial Intelligence techniques in building material research is becoming increasingly crucial. In this study, we developed two models for predicting the Los Angeles (LA) and Micro Deval (MDE) coefficients, two important geotechnical tests used to determine the quality of rock aggregates. These coefficients describe the resistance of aggregates to fragmentation and abrasion. The ultrasound velocity, porosity, and density of the rocks were determined and used as inputs to develop prediction models using multiple regression and an artificial neural network. These models may be used to assess the quality of rock aggregates at the exploration stage without the need for tedious laboratory analysis.  ( 2 min )
    Improving Accuracy of Interpretability Measures in Hyperparameter Optimization via Bayesian Algorithm Execution. (arXiv:2206.05447v2 [cs.LG] UPDATED)
    Despite all the benefits of automated hyperparameter optimization (HPO), most modern HPO algorithms are black-boxes themselves. This makes it difficult to understand the decision process which leads to the selected configuration, reduces trust in HPO, and thus hinders its broad adoption. Here, we study the combination of HPO with interpretable machine learning (IML) methods such as partial dependence plots. These techniques are more and more used to explain the marginal effect of hyperparameters on the black-box cost function or to quantify the importance of hyperparameters. However, if such methods are naively applied to the experimental data of the HPO process in a post-hoc manner, the underlying sampling bias of the optimizer can distort interpretations. We propose a modified HPO method which efficiently balances the search for the global optimum w.r.t. predictive performance \emph{and} the reliable estimation of IML explanations of an underlying black-box function by coupling Bayesian optimization and Bayesian Algorithm Execution. On benchmark cases of both synthetic objectives and HPO of a neural network, we demonstrate that our method returns more reliable explanations of the underlying black-box without a loss of optimization performance.  ( 2 min )
    From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. (arXiv:2302.05882v1 [stat.ML])
    This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying analysis bridges different regimes of interest, such as the classical gradient-flow regime of vanishing learning rate, the high-dimensional regime of large input dimension, and the overparameterised "mean-field" regime of large network width, covering as well the intermediate regimes where the limiting dynamics is determined by the interplay between these behaviours. In particular, in the high-dimensional limit, the infinite-width dynamics is found to remain close to a low-dimensional subspace spanned by the target principal directions. Our results therefore provide a unifying picture of the limiting SGD dynamics with synthetic data.  ( 2 min )
    LipLearner: Customizable Silent Speech Interactions on Mobile Devices. (arXiv:2302.05907v1 [cs.HC])
    Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.
    Exploiting Cultural Biases via Homoglyphs in Text-to-Image Synthesis. (arXiv:2209.08891v2 [cs.CV] UPDATED)
    Models for text-to-image synthesis, such as DALL-E~2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in a textual description, common models reflect cultural stereotypes and biases in their generated images. We analyze this behavior both qualitatively and quantitatively, and identify a model's text encoder as the root cause of the phenomenon. Additionally, malicious users or service providers may try to intentionally bias the image generation to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.  ( 2 min )
    Representation and Invariance in Reinforcement Learning. (arXiv:2112.07752v3 [cs.AI] UPDATED)
    Researchers have formalized reinforcement learning (RL) in different ways. If an agent in one RL framework is to run within another RL framework's environments, the agent must first be converted, or mapped, into that other framework. Whether or not this is possible depends on not only the RL frameworks in question and but also how intelligence itself is measured. In this paper, we lay foundations for studying relative-intelligence-preserving mappability between RL frameworks. We define two types of mappings, called weak and strong translations, between RL frameworks and prove that existence of these mappings enables two types of intelligence comparison according to the mappings preserving relative intelligence. We investigate the existence or lack thereof of these mappings between: (i) RL frameworks where agents go first and RL frameworks where environments go first; and (ii) twelve different RL frameworks differing in terms of whether or not agents or environments are required to be deterministic. In the former case, we consider various natural mappings between agent-first and environment-first RL and vice versa; we show some positive results (some such mappings are strong or weak translations) and some negative results (some such mappings are not). In the latter case, we completely characterize which of the twelve RL-framework pairs admit weak translations, under the assumption of integer-valued rewards and some additional mild assumptions.
    CoCoSoDa: Effective Contrastive Learning for Code Search. (arXiv:2204.03293v3 [cs.SE] UPDATED)
    Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.
    The infinite Viterbi alignment and decay-convexity. (arXiv:1810.04115v5 [math.PR] UPDATED)
    The infinite Viterbi alignment is the limiting maximum a-posteriori estimate of the unobserved path in a hidden Markov model as the length of the time horizon grows. For models on state-space $\mathbb{R}^{d}$ satisfying a new ``decay-convexity'' condition, we develop an approach to existence of the infinite Viterbi alignment in an infinite dimensional Hilbert space. Quantitative bounds on the distance to the infinite Viterbi alignment, which are the first of their kind, are derived and used to illustrate how approximate estimation via parallelization can be accurate and scaleable to high-dimensional problems because the rate of convergence to the infinite Viterbi alignment does not necessarily depend on $d$. The results are applied to approximate estimation via parallelization and a model of neural population activity.
    Dark solitons in Bose-Einstein condensates: a dataset for many-body physics research. (arXiv:2205.09114v2 [cond-mat.quant-gas] UPDATED)
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose--Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About $33~\%$ of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and OD as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.
    Autoselection of the Ensemble of Convolutional Neural Networks with Second-Order Cone Programming. (arXiv:2302.05950v1 [cs.LG])
    Ensemble techniques are frequently encountered in machine learning and engineering problems since the method combines different models and produces an optimal predictive solution. The ensemble concept can be adapted to deep learning models to provide robustness and reliability. Due to the growth of the models in deep learning, using ensemble pruning is highly important to deal with computational complexity. Hence, this study proposes a mathematical model which prunes the ensemble of Convolutional Neural Networks (CNN) consisting of different depths and layers that maximizes accuracy and diversity simultaneously with a sparse second order conic optimization model. The proposed model is tested on CIFAR-10, CIFAR-100 and MNIST data sets which gives promising results while reducing the complexity of models, significantly.
    SQA3D: Situated Question Answering in 3D Scenes. (arXiv:2210.07474v3 [cs.CV] UPDATED)
    We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.  ( 2 min )
    SplitGP: Achieving Both Generalization and Personalization in Federated Learning. (arXiv:2212.08343v2 [cs.LG] UPDATED)
    A fundamental challenge to providing edge-AI services is the need for a machine learning (ML) model that achieves personalization (i.e., to individual clients) and generalization (i.e., to unseen data) properties concurrently. Existing techniques in federated learning (FL) have encountered a steep tradeoff between these objectives and impose large computational requirements on edge devices during training and inference. In this paper, we propose SplitGP, a new split learning solution that can simultaneously capture generalization and personalization capabilities for efficient inference across resource-constrained clients (e.g., mobile/IoT devices). Our key idea is to split the full ML model into client-side and server-side components, and impose different roles to them: the client-side model is trained to have strong personalization capability optimized to each client's main task, while the server-side model is trained to have strong generalization capability for handling all clients' out-of-distribution tasks. We analytically characterize the convergence behavior of SplitGP, revealing that all client models approach stationary points asymptotically. Further, we analyze the inference time in SplitGP and provide bounds for determining model split ratios. Experimental results show that SplitGP outperforms existing baselines by wide margins in inference time and test accuracy for varying amounts of out-of-distribution samples.  ( 2 min )
    SCLIFD:Supervised Contrastive Knowledge Distillation for Incremental Fault Diagnosis under Limited Fault Data. (arXiv:2302.05929v1 [cs.LG])
    Intelligent fault diagnosis has made extraordinary advancements currently. Nonetheless, few works tackle class-incremental learning for fault diagnosis under limited fault data, i.e., imbalanced and long-tailed fault diagnosis, which brings about various notable challenges. Initially, it is difficult to extract discriminative features from limited fault data. Moreover, a well-trained model must be retrained from scratch to classify the samples from new classes, thus causing a high computational burden and time consumption. Furthermore, the model may suffer from catastrophic forgetting when trained incrementally. Finally, the model decision is biased toward the new classes due to the class imbalance. The problems can consequently lead to performance degradation of fault diagnosis models. Accordingly, we introduce a supervised contrastive knowledge distillation for incremental fault diagnosis under limited fault data (SCLIFD) framework to address these issues, which extends the classical incremental classifier and representation learning (iCaRL) framework from three perspectives. Primarily, we adopt supervised contrastive knowledge distillation (KD) to enhance its representation learning capability under limited fault data. Moreover, we propose a novel prioritized exemplar selection method adaptive herding (AdaHerding) to restrict the increase of the computational burden, which is also combined with KD to alleviate catastrophic forgetting. Additionally, we adopt the cosine classifier to mitigate the adverse impact of class imbalance. We conduct extensive experiments on simulated and real-world industrial processes under different imbalance ratios. Experimental results show that our SCLIFD outperforms the existing methods by a large margin.
    Data efficiency and extrapolation trends in neural network interatomic potentials. (arXiv:2302.05823v1 [cs.LG])
    Over the last few years, key architectural advances have been proposed for neural network interatomic potentials (NNIPs), such as incorporating message-passing networks, equivariance, or many-body expansion terms. Although modern NNIP models exhibit nearly negligible differences in energy/forces errors, improvements in accuracy are still considered the main target when developing new NNIP architectures. In this work, we investigate how architectural choices influence the trainability and generalization error in NNIPs, revealing trends in extrapolation, data efficiency, and loss landscapes. First, we show that modern NNIP architectures recover the underlying potential energy surface (PES) of the training data even when trained to corrupted labels. Second, generalization metrics such as errors on high-temperature samples from the 3BPA dataset are demonstrated to follow a scaling relation for a variety of models. Thus, improvements in accuracy metrics may not bring independent information on the robust generalization of NNIPs. To circumvent this problem, we relate loss landscapes to model generalization across datasets. Using this probe, we explain why NNIPs with similar accuracy metrics exhibit different abilities to extrapolate and how training to forces improves the optimization landscape of a model. As an example, we show that MACE can predict PESes with reasonable error after being trained to as few as five data points, making it an example of a "few-shot" model for learning PESes. On the other hand, models with similar accuracy metrics such as NequIP show smaller ability to extrapolate in this extremely low-data regime. Our work provides a deep learning justification for the performance of many common NNIPs, and introduces tools beyond accuracy metrics that can be used to inform the development of next-generation models.
    Geodesic Graph Neural Network for Efficient Graph Representation Learning. (arXiv:2210.02636v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have recently been applied to graph learning tasks and achieved state-of-the-art (SOTA) results. However, many competitive methods run GNNs multiple times with subgraph extraction and customized labeling to capture information that is hard for normal GNNs to learn. Such operations are time-consuming and do not scale to large graphs. In this paper, we propose an efficient GNN framework called Geodesic GNN (GDGNN) that requires only one GNN run and injects conditional relationships between nodes into the model without labeling. This strategy effectively reduces the runtime of subgraph methods. Specifically, we view the shortest paths between two nodes as the spatial graph context of the neighborhood around them. The GNN embeddings of nodes on the shortest paths are used to generate geodesic representations. Conditioned on the geodesic representations, GDGNN can generate node, link, and graph representations that carry much richer structural information than plain GNNs. We theoretically prove that GDGNN is more powerful than plain GNNs. We present experimental results to show that GDGNN achieves highly competitive performance with SOTA GNN models on various graph learning tasks while taking significantly less time.  ( 2 min )
    Review of Extreme Multilabel Classification. (arXiv:2302.05971v1 [cs.LG])
    Extreme multilabel classification or XML, in short, has emerged as a new subtopic of interest in machine learning. Compared to traditional multilabel classification, here the number of labels is extremely large, hence the name extreme multilabel classification. Using classical one versus all classification wont scale in this case due to large number of labels, same is true for any other classifiers. Embedding of labels as well as features into smaller label space is an essential first step. Moreover, other issues include existance of head and tail labels, where tail labels are labels which exist in relatively smaller number of given samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify the correctly the prediction for head or tail labels.  ( 2 min )
    Generalization Ability of Wide Neural Networks on $\mathbb{R}$. (arXiv:2302.05933v1 [stat.ML])
    We perform a study on the generalization ability of the wide two-layer ReLU neural network on $\mathbb{R}$. We first establish some spectral properties of the neural tangent kernel (NTK): $a)$ $K_{d}$, the NTK defined on $\mathbb{R}^{d}$, is positive definite; $b)$ $\lambda_{i}(K_{1})$, the $i$-th largest eigenvalue of $K_{1}$, is proportional to $i^{-2}$. We then show that: $i)$ when the width $m\rightarrow\infty$, the neural network kernel (NNK) uniformly converges to the NTK; $ii)$ the minimax rate of regression over the RKHS associated to $K_{1}$ is $n^{-2/3}$; $iii)$ if one adopts the early stopping strategy in training a wide neural network, the resulting neural network achieves the minimax rate; $iv)$ if one trains the neural network till it overfits the data, the resulting neural network can not generalize well. Finally, we provide an explanation to reconcile our theory and the widely observed ``benign overfitting phenomenon''.  ( 2 min )
    On the Role of Fixed Points of Dynamical Systems in Training Physics-Informed Neural Networks. (arXiv:2203.13648v2 [cs.LG] UPDATED)
    This paper empirically studies commonly observed training difficulties of Physics-Informed Neural Networks (PINNs) on dynamical systems. Our results indicate that fixed points which are inherent to these systems play a key role in the optimization of the in PINNs embedded physics loss function. We observe that the loss landscape exhibits local optima that are shaped by the presence of fixed points. We find that these local optima contribute to the complexity of the physics loss optimization which can explain common training difficulties and resulting nonphysical predictions. Under certain settings, e.g., initial conditions close to fixed points or long simulations times, we show that those optima can even become better than that of the desired solution.  ( 2 min )
    Alternating Implicit Projected SGD and Its Efficient Variants for Equality-constrained Bilevel Optimization. (arXiv:2211.07096v2 [cs.LG] UPDATED)
    Stochastic bilevel optimization, which captures the inherent nested structure of machine learning problems, is gaining popularity in many recent applications. Existing works on bilevel optimization mostly consider either unconstrained problems or constrained upper-level problems. This paper considers the stochastic bilevel optimization problems with equality constraints both in the upper and lower levels. By leveraging the special structure of the equality constraints problem, the paper first presents an alternating implicit projected SGD approach and establishes the $\tilde{\cal O}(\epsilon^{-2})$ sample complexity that matches the state-of-the-art complexity of ALSET \citep{chen2021closing} for unconstrained bilevel problems. To further save the cost of projection, the paper presents two alternating implicit projection-efficient SGD approaches, where one algorithm enjoys the $\tilde{\cal O}(\epsilon^{-2}/T)$ upper-level and $\tilde{\cal O}(\epsilon^{-1.5}/T^{\frac{3}{4}})$ lower-level projection complexity with ${\cal O}(T)$ lower-level batch size, and the other one enjoys $\tilde{\cal O}(\epsilon^{-1.5})$ upper-level and lower-level projection complexity with ${\cal O}(1)$ batch size. Application to federated bilevel optimization has been presented to showcase the empirical performance of our algorithms. Our results demonstrate that equality-constrained bilevel optimization with strongly-convex lower-level problems can be solved as efficiently as stochastic single-level optimization problems.  ( 2 min )
    Recursive Estimation of Conditional Kernel Mean Embeddings. (arXiv:2302.05955v1 [stat.ML])
    Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.  ( 2 min )
    Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction. (arXiv:2105.04544v6 [cs.LG] UPDATED)
    We address the problem of causal effect estimation in the presence of unobserved confounding, but where proxies for the latent confounder(s) are observed. We propose two kernel-based methods for nonlinear causal effect estimation in this setting: (a) a two-stage regression approach, and (b) a maximum moment restriction approach. We focus on the proximal causal learning setting, but our methods can be used to solve a wider class of inverse problems characterised by a Fredholm integral equation. In particular, we provide a unifying view of two-stage and moment restriction approaches for solving this problem in a nonlinear setting. We provide consistency guarantees for each algorithm, and we demonstrate these approaches achieve competitive results on synthetic data and data simulating a real-world task. In particular, our approach outperforms earlier methods that are not suited to leveraging proxy variables.
    Self-supervised EEG Representation Learning for Automatic Sleep Staging. (arXiv:2110.15278v3 [eess.SP] UPDATED)
    Background: Deep learning models have shown great success in automating tasks in sleep medicine by learning from carefully annotated Electroencephalogram (EEG) data. However, effectively utilizing a large amount of raw EEG remains a challenge. Objective: In this paper, we aim to learn robust vector representations from massive unlabeled EEG signals, such that the learned vectorized features (1) are expressive enough to replace the raw signals in the sleep staging task; and (2) provide better predictive performance than supervised models in scenarios of fewer labels and noisy samples. Methods: We propose a self-supervised model, named Contrast with the World Representation (ContraWR), for EEG signal representation learning, which uses global statistics from the dataset to distinguish signals associated with different sleep stages. The ContraWR model is evaluated on three real-world EEG datasets that include both at-home and in-lab EEG recording settings. Results: ContraWR outperforms 4 recent self-supervised learning methods on the sleep staging task across 3 large EEG datasets. ContraWR also beats supervised learning when fewer training labels are available (e.g., 4% accuracy improvement when less than 2% data is labeled). Moreover, the model provides informative representative feature structures in 2D projection. Conclusions: We show that ContraWR is robust to noise and can provide high-quality EEG representations for downstream prediction tasks. The proposed model can be generalized to other unsupervised physiological signal learning tasks. Future directions include exploring task-specific data augmentations and combining self-supervised with supervised methods, building upon the initial success of self-supervised learning in this paper.
    Koopman-Based Bound for Generalization: New Aspect of Neural Networks Regarding Nonlinear Noise Filtering. (arXiv:2302.05825v1 [cs.LG])
    We propose a new bound for generalization of neural networks using Koopman operators. Unlike most of the existing works, we focus on the role of the final nonlinear transformation of the networks. Our bound is described by the reciprocal of the determinant of the weight matrices and is tighter than existing norm-based bounds when the weight matrices do not have small singular values. According to existing theories about the low-rankness of the weight matrices, it may be counter-intuitive that we focus on the case where singular values of weight matrices are not small. However, motivated by the final nonlinear transformation, we can see that our result sheds light on a new perspective regarding a noise filtering property of neural networks. Since our bound comes from Koopman operators, this work also provides a connection between operator-theoretic analysis and generalization of neural networks. Numerical results support the validity of our theoretical results.
    Near-optimal learning with average H\"older smoothness. (arXiv:2302.06005v1 [cs.LG])
    We generalize the notion of average Lipschitz smoothness proposed by Ashlagi et al. (COLT 2021) by extending it to H\"older smoothness. This measure of the ``effective smoothness'' of a function is sensitive to the underlying distribution and can be dramatically smaller than its classic ``worst-case'' H\"older constant. We prove nearly tight upper and lower risk bounds in terms of the average H\"older smoothness, establishing the minimax rate in the realizable regression setting up to log factors; this was not previously known even in the special case of average Lipschitz smoothness. From an algorithmic perspective, since our notion of average smoothness is defined with respect to the unknown sampling distribution, the learner does not have an explicit representation of the function class, hence is unable to execute ERM. Nevertheless, we provide a learning algorithm that achieves the (nearly) optimal learning rate. Our results hold in any totally bounded metric space, and are stated in terms of its intrinsic geometry. Overall, our results show that the classic worst-case notion of H\"older smoothness can be essentially replaced by its average, yielding considerably sharper guarantees.
    Distribution-Free Model for Community Detection. (arXiv:2111.07495v4 [cs.SI] UPDATED)
    Community detection for unweighted networks has been widely studied in network analysis, but the case of weighted networks remains a challenge. This paper proposes a general Distribution-Free Model (DFM) for weighted networks in which nodes are partitioned into different communities. DFM can be seen as a generalization of the famous stochastic blockmodels from unweighted networks to weighted networks. DFM does not require prior knowledge of a specific distribution for elements of the adjacency matrix but only the expected value. In particular, signed networks with latent community structures can be modeled by DFM. We build a theoretical guarantee to show that a simple spectral clustering algorithm stably yields consistent community detection under DFM. We also propose a four-step data generation process to generate adjacency matrices with missing edges by combining DFM, noise matrix, and a model for unweighted networks. Using experiments with simulated and real datasets, we show that some benchmark algorithms can successfully recover community membership for weighted networks generated by the proposed data generation process.
    An unsupervised learning approach for predicting wind farm power and downstream wakes using weather patterns. (arXiv:2302.05886v1 [stat.ML])
    Wind energy resource assessment typically requires numerical models, but such models are too computationally intensive to consider multi-year timescales. Increasingly, unsupervised machine learning techniques are used to identify a small number of representative weather patterns to simulate long-term behaviour. Here we develop a novel wind energy workflow that for the first time combines weather patterns derived from unsupervised clustering techniques with numerical weather prediction models (here WRF) to obtain efficient and accurate long-term predictions of power and downstream wakes from an entire wind farm. We use ERA5 reanalysis data clustering not only on low altitude pressure but also, for the first time, on the more relevant variable of wind velocity. We also compare the use of large-scale and local-scale domains for clustering. A WRF simulation is run at each of the cluster centres and the results are aggregated using a novel post-processing technique. By applying our workflow to two different regions, we show that our long-term predictions agree with those from a year of WRF simulations but require less than 2% of the computational time. The most accurate results are obtained when clustering on wind velocity. Moreover, clustering over the Europe-wide domain is sufficient for predicting wind farm power output, but downstream wake predictions benefit from the use of smaller domains. Finally, we show that these downstream wakes can affect the local weather patterns. Our approach facilitates multi-year predictions of power output and downstream farm wakes, by providing a fast, accurate and flexible methodology that is applicable to any global region. Moreover, these accurate long-term predictions of downstream wakes provide the first tool to help mitigate the effects of wind energy loss downstream of wind farms, since they can be used to determine optimum wind farm locations.
    Wide stochastic networks: Gaussian limit and PAC-Bayesian training. (arXiv:2106.09798v3 [stat.ML] UPDATED)
    The limit of infinite width allows for substantial simplifications in the analytical study of over-parameterised neural networks. With a suitable random initialisation, an extremely large network exhibits an approximately Gaussian behaviour. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables, holding both before and during training. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimises the generalisation bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.
    Differentially Private Normalizing Flows for Density Estimation, Data Synthesis, and Variational Inference with Application to Electronic Health Records. (arXiv:2302.05787v1 [stat.ML])
    Electronic health records (EHR) often contain sensitive medical information about individual patients, posing significant limitations to sharing or releasing EHR data for downstream learning and inferential tasks. We use normalizing flows (NF), a family of deep generative models, to estimate the probability density of a dataset with differential privacy (DP) guarantees, from which privacy-preserving synthetic data are generated. We apply the technique to an EHR dataset containing patients with pulmonary hypertension. We assess the learning and inferential utility of the synthetic data by comparing the accuracy in the prediction of the hypertension status and variational posterior distribution of the parameters of a physics-based model. In addition, we use a simulated dataset from a nonlinear model to compare the results from variational inference (VI) based on privacy-preserving synthetic data, and privacy-preserving VI obtained from directly privatizing NFs for VI with DP guarantees given the original non-private dataset. The results suggest that synthetic data generated through differentially private density estimation with NF can yield good utility at a reasonable privacy cost. We also show that VI obtained from differentially private NF based on the free energy bound loss may produce variational approximations with significantly altered correlation structure, and loss formulations based on alternative dissimilarity metrics between two distributions might provide improved results.
    Deep Reinforcement Learning for Unmanned Aerial Vehicle-Assisted Vehicular Networks. (arXiv:1906.05015v11 [cs.LG] UPDATED)
    Unmanned aerial vehicles (UAVs) are envisioned to complement the 5G communication infrastructure in future smart cities. Hot spots easily appear in road intersections, where effective communication among vehicles is challenging. UAVs may serve as relays with the advantages of low price, easy deployment, line-of-sight links, and flexible mobility. In this paper, we study a UAV-assisted vehicular network where the UAV jointly adjusts its transmission control (power and channel) and 3D flight to maximize the total throughput. First, we formulate a Markov decision process (MDP) problem by modeling the mobility of the UAV/vehicles and the state transitions. Secondly, we solve the target problem using a deep reinforcement learning method, namely, the deep deterministic policy gradient (DDPG), and propose three solutions with different control objectives. Deep reinforcement learning methods obtain the optimal policy through the interactions with the environment without knowing the environment variables. Considering that environment variables in our problem are unknown and unmeasurable, we choose a deep reinforcement learning method to solve it. Moreover, considering the energy consumption of 3D flight, we extend the proposed solutions to maximize the total throughput per unit energy. To encourage or discourage the UAV's mobility according to its prediction, the DDPG framework is modified, where the UAV adjusts its learning rate automatically. Thirdly, in a simplified model with small state space and action space, we verify the optimality of proposed algorithms. Comparing with two baseline schemes, we demonstrate the effectiveness of proposed algorithms in a realistic model.
    Generative Sampling in Bundle Tractography using Autoencoders (GESTA). (arXiv:2204.10891v2 [cs.CV] UPDATED)
    Current tractography methods use the local orientation information to propagate streamlines from seed locations. Many such seeds provide streamlines that stop prematurely or fail to map the true white matter pathways because some bundles are "harder-to-track" than others. This results in tractography reconstructions with poor white and gray matter spatial coverage. In this work, we propose a generative, autoencoder-based method, named GESTA (Generative Sampling in Bundle Tractography using Autoencoders), that produces streamlines achieving better spatial coverage. Compared to other deep learning methods, our autoencoder-based framework uses a single model to generate streamlines in a bundle-wise fashion, and does not require to propagate local orientations. GESTA produces new and complete streamlines for any given white matter bundle, including hard-to-track bundles. Applied on top of a given tractogram, GESTA is shown to be effective in improving the white matter volume coverage in poorly populated bundles, both on synthetic and human brain in vivo data. Our streamline evaluation framework ensures that the streamlines produced by GESTA are anatomically plausible and fit well to the local diffusion signal. The streamline evaluation criteria assess anatomy (white matter coverage), local orientation alignment (direction), and geometry features of streamlines, and optionally, gray matter connectivity. GESTA is thus a novel deep generative bundle tractography method that can be used to improve the tractography reconstruction of the white matter.  ( 2 min )
    A large parametrized space of meta-reinforcement learning tasks. (arXiv:2302.05583v1 [cs.LG])
    We describe a parametrized space for simple meta-reinforcement-learning (meta-RL) tasks with arbitrary stimuli. The parametrization allows us to randomly generate an arbitrary number of novel simple meta-learning tasks. The space of meta-RL tasks covered by this parametrization includes many well-known meta-RL tasks, such as bandit tasks, the Harlow task, T-mazes, the Daw two-step task and others. Simple extensions allow it to capture tasks based on two-dimensional topological spaces, such as find-the-spot or key-door tasks. We describe a number of randomly generated meta-RL tasks and discuss potential issues arising from random generation.  ( 2 min )
    Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining. (arXiv:2210.04782v2 [cs.CL] UPDATED)
    Advances in neural modeling have achieved state-of-the-art (SOTA) results on public natural language processing (NLP) benchmarks, at times surpassing human performance. However, there is a gap between public benchmarks and real-world applications where noise, such as typographical or grammatical mistakes, is abundant and can result in degraded performance. Unfortunately, works which evaluate the robustness of neural models on noisy data and propose improvements, are limited to the English language. Upon analyzing noise in different languages, we observe that noise types vary greatly across languages. Thus, existing investigations do not generalize trivially to multilingual settings. To benchmark the performance of pretrained multilingual language models, we construct noisy datasets covering five languages and four NLP tasks and observe a clear gap in the performance between clean and noisy data in the zero-shot cross-lingual setting. After investigating several ways to boost the robustness of multilingual models in this setting, we propose Robust Contrastive Pretraining (RCP). RCP combines data augmentation with a contrastive loss term at the pretraining stage and achieves large improvements on noisy (and original test data) across two sentence-level (+3.2%) and two sequence-labeling (+10 F1-score) multilingual classification tasks.  ( 2 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v6 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input and produce as an output nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general types of nonnegative neural networks. The results of this paper contribute to the understanding of the behavior of autoencoders, and they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations.  ( 2 min )
    Is Distance Matrix Enough for Geometric Deep Learning?. (arXiv:2302.05743v1 [cs.LG])
    Graph Neural Networks (GNNs) are often used for tasks involving the geometry of a given graph, such as molecular dynamics simulation. While the distance matrix of a graph contains the complete geometric structure information, whether GNNs can learn this geometry solely from the distance matrix has yet to be studied. In this work, we first demonstrate that Message Passing Neural Networks (MPNNs) are insufficient for learning the geometry of a graph from its distance matrix by constructing families of geometric graphs which cannot be distinguished by MPNNs. We then propose $k$-DisGNNs, which can effectively exploit the rich geometry contained in the distance matrix. We demonstrate the high expressive power of our models and prove that some existing well-designed geometric models can be unified by $k$-DisGNNs as special cases. Most importantly, we establish a connection between geometric deep learning and traditional graph representation learning, showing that those highly expressive GNN models originally designed for graph structure learning can also be applied to geometric deep learning problems with impressive performance, and that existing complex, equivariant models are not the only solution. Experimental results verify our theory.  ( 2 min )
    How to prepare your task head for finetuning. (arXiv:2302.05779v1 [cs.LG])
    In deep learning, transferring information from a pretrained network to a downstream task by finetuning has many benefits. The choice of task head plays an important role in fine-tuning, as the pretrained and downstream tasks are usually different. Although there exist many different designs for finetuning, a full understanding of when and why these algorithms work has been elusive. We analyze how the choice of task head controls feature adaptation and hence influences the downstream performance. By decomposing the learning dynamics of adaptation, we find that the key aspect is the training accuracy and loss at the beginning of finetuning, which determines the "energy" available for the feature's adaptation. We identify a significant trend in the effect of changes in this initial energy on the resulting features after fine-tuning. Specifically, as the energy increases, the Euclidean and cosine distances between the resulting and original features increase, while their dot products (and the resulting features' norm) first increase and then decrease. Inspired by this, we give several practical principles that lead to better downstream performance. We analytically prove this trend in an overparamterized linear setting and verify its applicability to different experimental settings.  ( 2 min )
    A Framework for Overparameterized Learning. (arXiv:2205.13507v2 [cs.LG] UPDATED)
    A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.  ( 2 min )
    CUDA: Curriculum of Data Augmentation for Long-Tailed Recognition. (arXiv:2302.05499v1 [cs.CV])
    Class imbalance problems frequently occur in real-world tasks, and conventional deep learning algorithms are well known for performance degradation on imbalanced training datasets. To mitigate this problem, many approaches have aimed to balance among given classes by re-weighting or re-sampling training samples. These re-balancing methods increase the impact of minority classes and reduce the influence of majority classes on the output of models. However, the extracted representations may be of poor quality owing to the limited number of minority samples. To handle this restriction, several methods have been developed that increase the representations of minority samples by leveraging the features of the majority samples. Despite extensive recent studies, no deep analysis has been conducted on determination of classes to be augmented and strength of augmentation has been conducted. In this study, we first investigate the correlation between the degree of augmentation and class-wise performance, and find that the proper degree of augmentation must be allocated for each class to mitigate class imbalance problems. Motivated by this finding, we propose a simple and efficient novel curriculum, which is designed to find the appropriate per-class strength of data augmentation, called CUDA: CUrriculum of Data Augmentation for long-tailed recognition. CUDA can simply be integrated into existing long-tailed recognition methods. We present the results of experiments showing that CUDA effectively achieves better generalization performance compared to the state-of-the-art method on various imbalanced datasets such as CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.  ( 2 min )
    On the geometry of Stein variational gradient descent. (arXiv:1912.00894v2 [stat.ML] UPDATED)
    Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performance gains of these in various numerical experiments.
    Deep Unfolding of the DBFB Algorithm with Application to ROI CT Imaging with Limited Angular Density. (arXiv:2209.13264v2 [eess.IV] UPDATED)
    This paper presents a new method for reconstructing regions of interest (ROI) from a limited number of computed tomography (CT) measurements. Classical model-based iterative reconstruction methods lead to images with predictable features. Still, they often suffer from tedious parameterization and slow convergence. On the contrary, deep learning methods are fast, and they can reach high reconstruction quality by leveraging information from large datasets, but they lack interpretability. At the crossroads of both methods, deep unfolding networks have been recently proposed. Their design includes the physics of the imaging system and the steps of an iterative optimization algorithm. Motivated by the success of these networks for various applications, we introduce an unfolding neural network called U-RDBFB designed for ROI CT reconstruction from limited data. Few-view truncated data are effectively handled thanks to a robust non-convex data fidelity term combined with a sparsity-inducing regularization function. We unfold the Dual Block coordinate Forward-Backward (DBFB) algorithm, embedded in an iterative reweighted scheme, allowing the learning of key parameters in a supervised manner. Our experiments show an improvement over several state-of-the-art methods, including a model-based iterative scheme, a multi-scale deep learning architecture, and deep unfolding methods.  ( 2 min )
    Interpretable Deep Learning for Forecasting Online Advertising Costs: Insights from the Competitive Bidding Landscape. (arXiv:2302.05762v1 [cs.LG])
    As advertisers increasingly shift their budgets toward digital advertising, forecasting advertising costs is essential for making budget plans to optimize marketing campaign returns. In this paper, we perform a comprehensive study using a variety of time-series forecasting methods to predict daily average cost-per-click (CPC) in the online advertising market. We show that forecasting advertising costs would benefit from multivariate models using covariates from competitors' CPC development identified through time-series clustering. We further interpret the results by analyzing feature importance and temporal attention. Finally, we show that our approach has several advantages over models that individual advertisers might build based solely on their collected data.  ( 2 min )
    SpReME: Sparse Regression for Multi-Environment Dynamic Systems. (arXiv:2302.05942v1 [cs.LG])
    Learning dynamical systems is a promising avenue for scientific discoveries. However, capturing the governing dynamics in multiple environments still remains a challenge: model-based approaches rely on the fidelity of assumptions made for a single environment, whereas data-driven approaches based on neural networks are often fragile on extrapolating into the future. In this work, we develop a method of sparse regression dubbed SpReME to discover the major dynamics that underlie multiple environments. Specifically, SpReME shares a sparse structure of ordinary differential equation (ODE) across different environments in common while allowing each environment to keep the coefficients of ODE terms independently. We demonstrate that the proposed model captures the correct dynamics from multiple environments over four different dynamic systems with improved prediction performance.  ( 2 min )
    CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models. (arXiv:2210.04191v2 [cs.CL] UPDATED)
    We motivate and introduce CHARD: Clinical Health-Aware Reasoning across Dimensions, to investigate the capability of text generation models to act as implicit clinical knowledge bases and generate free-flow textual explanations about various health-related conditions across several dimensions. We collect and present an associated dataset, CHARDat, consisting of explanations about 52 health conditions across three clinical dimensions. We conduct extensive experiments using BART and T5 along with data augmentation, and perform automatic, human, and qualitative analyses. We show that while our models can perform decently, CHARD is very challenging with strong potential for further exploration.  ( 2 min )
    SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video Anomaly Detection. (arXiv:2207.08003v4 [cs.CV] UPDATED)
    A self-supervised multi-task learning (SSMTL) framework for video anomaly detection was recently introduced in literature. Due to its highly accurate results, the method attracted the attention of many researchers. In this work, we revisit the self-supervised multi-task learning framework, proposing several updates to the original method. First, we study various detection methods, e.g. based on detecting high-motion regions using optical flow or background subtraction, since we believe the currently used pre-trained YOLOv3 is suboptimal, e.g. objects in motion or objects from unknown classes are never detected. Second, we modernize the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers. As such, we alternatively introduce both 2D and 3D convolutional vision transformer (CvT) blocks. Third, in our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps through knowledge distillation, solving jigsaw puzzles, estimating body pose through knowledge distillation, predicting masked regions (inpainting), and adversarial learning with pseudo-anomalies. We conduct experiments to assess the performance impact of the introduced changes. Upon finding more promising configurations of the framework, dubbed SSMTL++v1 and SSMTL++v2, we extend our preliminary experiments to more data sets, demonstrating that our performance gains are consistent across all data sets. In most cases, our results on Avenue, ShanghaiTech and UBnormal raise the state-of-the-art performance bar to a new level.  ( 2 min )
    Conditional Positional Encodings for Vision Transformers. (arXiv:2102.10882v3 [cs.CV] UPDATED)
    We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training. Besides, CPE can keep the desired translation-invariance in the image classification task, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our code is available at https://github.com/Meituan-AutoML/CPVT .  ( 2 min )
    Mitigating Dataset Bias by Using Per-sample Gradient. (arXiv:2205.15704v3 [cs.LG] UPDATED)
    The performance of deep neural networks is strongly influenced by the training dataset setup. In particular, when attributes having a strong correlation with the target attribute are present, the trained model can provide unintended prejudgments and show significant inference errors (i.e., the dataset bias problem). Various methods have been proposed to mitigate dataset bias, and their emphasis is on weakly correlated samples, called bias-conflicting samples. These methods are based on explicit bias labels involving human or empirical correlation metrics (e.g., training loss). However, such metrics require human costs or have insufficient theoretical explanation. In this study, we propose a debiasing algorithm, called PGD (Per-sample Gradient-based Debiasing), that comprises three steps: (1) training a model on uniform batch sampling, (2) setting the importance of each sample in proportion to the norm of the sample gradient, and (3) training the model using importance-batch sampling, whose probability is obtained in step (2). Compared with existing baselines for various synthetic and real-world datasets, the proposed method showed state-of-the-art accuracy for a the classification task. Furthermore, we describe theoretical understandings about how PGD can mitigate dataset bias.  ( 2 min )
    On the Computational Efficiency of Adaptive and Dynamic Regret Minimization. (arXiv:2207.00646v4 [cs.LG] UPDATED)
    In online convex optimization, the player aims to minimize regret, or the difference between her loss and that of the best fixed decision in hindsight over the entire repeated game. Algorithms that minimize (standard) regret may converge to a fixed decision, which is undesirable in changing or dynamic environments. This motivates the stronger metrics of performance, notably adaptive and dynamic regret. Adaptive regret is the maximum regret over any continuous sub-interval in time. Dynamic regret is the difference between the total cost and that of the best sequence of decisions in hindsight. State-of-the-art performance in both adaptive and dynamic regret minimization suffers a computational penalty - typically on the order of a multiplicative factor that grows logarithmically in the number of game iterations. In this paper we show how to reduce this computational penalty to be doubly logarithmic in the number of game iterations, and retain near optimal adaptive and dynamic regret bounds.  ( 2 min )
    Towards A Proactive ML Approach for Detecting Backdoor Poison Samples. (arXiv:2205.13616v2 [cs.LG] UPDATED)
    Adversaries can embed backdoors in deep learning models by introducing backdoor poison samples into training datasets. In this work, we investigate how to detect such poison samples to mitigate the threat of backdoor attacks. First, we uncover a post-hoc workflow underlying most prior work, where defenders passively allow the attack to proceed and then leverage the characteristics of the post-attacked model to uncover poison samples. We reveal that this workflow does not fully exploit defenders' capabilities, and defense pipelines built on it are prone to failure or performance degradation in many scenarios. Second, we suggest a paradigm shift by promoting a proactive mindset in which defenders engage proactively with the entire model training and poison detection pipeline, directly enforcing and magnifying distinctive characteristics of the post-attacked model to facilitate poison detection. Based on this, we formulate a unified framework and provide practical insights on designing detection pipelines that are more robust and generalizable. Third, we introduce the technique of Confusion Training (CT) as a concrete instantiation of our framework. CT applies an additional poisoning attack to the already poisoned dataset, actively decoupling benign correlation while exposing backdoor patterns to detection. Empirical evaluations on 4 datasets and 14 types of attacks validate the superiority of CT over 11 baseline defenses.  ( 2 min )
    Policy-Induced Self-Supervision Improves Representation Finetuning in Visual RL. (arXiv:2302.06009v1 [cs.LG])
    We study how to transfer representations pretrained on source tasks to target tasks in visual percept based RL. We analyze two popular approaches: freezing or finetuning the pretrained representations. Empirical studies on a set of popular tasks reveal several properties of pretrained representations. First, finetuning is required even when pretrained representations perfectly capture the information required to solve the target task. Second, finetuned representations improve learnability and are more robust to noise. Third, pretrained bottom layers are task-agnostic and readily transferable to new tasks, while top layers encode task-specific information and require adaptation. Building on these insights, we propose a self-supervised objective that clusters representations according to the policy they induce, as opposed to traditional representation similarity measures which are policy-agnostic (e.g. Euclidean norm, cosine similarity). Together with freezing the bottom layers, this objective results in significantly better representation than frozen, finetuned, and self-supervised alternatives on a wide range of benchmarks.  ( 2 min )
    A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks. (arXiv:2001.11443v3 [cs.LG] UPDATED)
    We develop a mathematically rigorous framework for multilayer neural networks in the mean field regime. As the network's widths increase, the network's learning trajectory is shown to be well captured by a meaningful and dynamically nonlinear limit (the \textit{mean field} limit), which is characterized by a system of ODEs. Our framework applies to a broad range of network architectures, learning dynamics and network initializations. Central to the framework is the new idea of a \textit{neuronal embedding}, which comprises of a non-evolving probability space that allows to embed neural networks of arbitrary widths. Using our framework, we prove several properties of large-width multilayer neural networks. Firstly we show that independent and identically distributed initializations cause strong degeneracy effects on the network's learning trajectory when the network's depth is at least four. Secondly we obtain several global convergence guarantees for feedforward multilayer networks under a number of different setups. These include two-layer and three-layer networks with independent and identically distributed initializations, and multilayer networks of arbitrary depths with a special type of correlated initializations that is motivated by the new concept of \textit{bidirectional diversity}. Unlike previous works that rely on convexity, our results admit non-convex losses and hinge on a certain universal approximation property, which is a distinctive feature of infinite-width neural networks and is shown to hold throughout the training process. Aside from being the first known results for global convergence of multilayer networks in the mean field regime, they demonstrate flexibility of our framework and incorporate several new ideas and insights that depart from the conventional convex optimization wisdom.  ( 3 min )
    Efficient Fraud Detection using Deep Boosting Decision Trees. (arXiv:2302.05918v1 [stat.ML])
    Fraud detection is to identify, monitor, and prevent potentially fraudulent activities from complex data. The recent development and success in AI, especially machine learning, provides a new data-driven way to deal with fraud. From a methodological point of view, machine learning based fraud detection can be divided into two categories, i.e., conventional methods (decision tree, boosting...) and deep learning, both of which have significant limitations in terms of the lack of representation learning ability for the former and interpretability for the latter. Furthermore, due to the rarity of detected fraud cases, the associated data is usually imbalanced, which seriously degrades the performance of classification algorithms. In this paper, we propose deep boosting decision trees (DBDT), a novel approach for fraud detection based on gradient boosting and neural networks. In order to combine the advantages of both conventional methods and deep learning, we first construct soft decision tree (SDT), a decision tree structured model with neural networks as its nodes, and then ensemble SDTs using the idea of gradient boosting. In this way we embed neural networks into gradient boosting to improve its representation learning capability and meanwhile maintain the interpretability. Furthermore, aiming at the rarity of detected fraud cases, in the model training phase we propose a compositional AUC maximization approach to deal with data imbalances at algorithm level. Extensive experiments on several real-life fraud detection datasets show that DBDT can significantly improve the performance and meanwhile maintain good interpretability. Our code is available at https://github.com/freshmanXB/DBDT.  ( 2 min )
    Plasticity Neural Network Based on Astrocytic effects at Critical Period, Synaptic Competition and Strength Rebalance by Current and Mnemonic Brain Plasticity and Synapse Formation. (arXiv:2203.11740v9 [cs.NE] UPDATED)
    In addition to the weights of synaptic shared connections, PNN includes weights of synaptic effective ranges [14-24]. PNN considers synaptic strength balance in dynamic of phagocytosing of synapses and static of constant sum of synapses length [14], and includes the lead behavior of the school of fish. Synapse formation will inhibit dendrites generation to a certain extent in experiments and PNN simulations [15]. The memory persistence gradient of retrograde circuit similar to the Enforcing Resilience in a Spring Boot. The relatively good and inferior gradient information stored in memory engram cells in synapse formation of retrograde circuit like the folds of the brain [16]. The controversy was claimed if human hippocampal neurogenesis persists throughout aging, PNN considered it may have a new and longer circuit in late iteration [17,18]. Closing the critical period will cause neurological disorder in experiments and PNN simulations [19]. Considering both negative and positive memories persistence help activate synapse length changes with iterations better than only considering positive memory [20]. Astrocytic phagocytosis will avoid the local accumulation of synapses by simulation, Lack of astrocytic phagocytosis causes excitatory synapses and functionally impaired synapses accumulate in experiments and lead to destruction of cognition, but local longer synapses and worse results in PNN simulations [21]. It gives relationship of intelligence and cortical thickness, individual differences in brain [22]. The PNN also considered the memory engram cells that strengthened Synaptic strength [23]. The effects of PNN's memory structure and tPBM may be the same for powerful penetrability of signals [24]. Memory persistence also inhibit local synaptic accumulation. By PNN, it may introduce the relatively good and inferior solution in PSO. The simple PNN only has the synaptic phagocytosis.  ( 3 min )
    Improved Dynamic Regret for Online Frank-Wolfe. (arXiv:2302.05620v1 [cs.LG])
    To deal with non-stationary online problems with complex constraints, we investigate the dynamic regret of online Frank-Wolfe (OFW), which is an efficient projection-free algorithm for online convex optimization. It is well-known that in the setting of offline optimization, the smoothness of functions and the strong convexity of functions accompanying specific properties of constraint sets can be utilized to achieve fast convergence rates for the Frank-Wolfe (FW) algorithm. However, for OFW, previous studies only establish a dynamic regret bound of $O(\sqrt{T}(1+V_T+\sqrt{D_T}))$ by utilizing the convexity of problems, where $T$ is the number of rounds, $V_T$ is the function variation, and $D_T$ is the gradient variation. In this paper, we derive improved dynamic regret bounds for OFW by extending the fast convergence rates of FW from offline optimization to online optimization. The key technique for this extension is to set the step size of OFW with a line search rule. In this way, we first show that the dynamic regret bound of OFW can be improved to $O(\sqrt{T(1+V_T)})$ for smooth functions. Second, we achieve a better dynamic regret bound of $O((1+V_T)^{2/3}T^{1/3})$ when functions are smooth and strongly convex, and the constraint set is strongly convex. Finally, for smooth and strongly convex functions with minimizers in the interior of the constraint set, we demonstrate that the dynamic regret of OFW reduces to $O(1+V_T)$, and can be further strengthened to $O(\min\{P_T^\ast,S_T^\ast,V_T\}+1)$ by performing a constant number of FW iterations per round, where $P_T^\ast$ and $S_T^\ast$ denote the path length and squared path length of minimizers, respectively.  ( 2 min )
    Compositional Exemplars for In-context Learning. (arXiv:2302.05698v1 [cs.CL])
    Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL) ability, where the model learns to do an unseen task via a prompt consisting of input-output examples as the demonstration, without any parameter updates. The performance of ICL is highly dominated by the quality of the selected in-context examples. However, previous selection methods are mostly based on simple heuristics, leading to sub-optimal performance. In this work, we formulate in-context example selection as a subset selection problem. We propose CEIL(Compositional Exemplars for In-context Learning), which is instantiated by Determinantal Point Processes (DPPs) to model the interaction between the given input and in-context examples, and optimized through a carefully-designed contrastive learning objective to obtain preference from LMs. We validate CEIL on 12 classification and generation datasets from 7 distinct NLP tasks, including sentiment analysis, paraphrase detection, natural language inference, commonsense reasoning, open-domain question answering, code generation, and semantic parsing. Extensive experiments demonstrate not only the state-of-the-art performance but also the transferability and compositionality of CEIL, shedding new light on effective and efficient in-context learning. Our code is released at https://github.com/HKUNLP/icl-ceil.  ( 2 min )
    Synaptic Stripping: How Pruning Can Bring Dead Neurons Back To Life. (arXiv:2302.05818v1 [cs.LG])
    Rectified Linear Units (ReLU) are the default choice for activation functions in deep neural networks. While they demonstrate excellent empirical performance, ReLU activations can fall victim to the dead neuron problem. In these cases, the weights feeding into a neuron end up being pushed into a state where the neuron outputs zero for all inputs. Consequently, the gradient is also zero for all inputs, which means that the weights which feed into the neuron cannot update. The neuron is not able to recover from direct back propagation and model capacity is reduced as those parameters can no longer be further optimized. Inspired by a neurological process of the same name, we introduce Synaptic Stripping as a means to combat this dead neuron problem. By automatically removing problematic connections during training, we can regenerate dead neurons and significantly improve model capacity and parametric utilization. Synaptic Stripping is easy to implement and results in sparse networks that are more efficient than the dense networks they are derived from. We conduct several ablation studies to investigate these dynamics as a function of network width and depth and we conduct an exploration of Synaptic Stripping with Vision Transformers on a variety of benchmark datasets.  ( 2 min )
    On Testing and Comparing Fair classifiers under Data Bias. (arXiv:2302.05906v1 [cs.LG])
    In this paper, we consider a theoretical model for injecting data bias, namely, under-representation and label bias (Blum & Stangl, 2019). We theoretically and empirically study its effect on the accuracy and fairness of fair classifiers. Theoretically, we prove that the Bayes optimal group-aware fair classifier on the original data distribution can be recovered by simply minimizing a carefully chosen reweighed loss on the bias-injected distribution. Through extensive experiments on both synthetic and real-world datasets (e.g., Adult, German Credit, Bank Marketing, COMPAS), we empirically audit pre-, in-, and post-processing fair classifiers from standard fairness toolkits for their fairness and accuracy by injecting varying amounts of under-representation and label bias in their training data (but not the test data). Our main observations are: (1) The fairness and accuracy of many standard fair classifiers degrade severely as the bias injected in their training data increases, (2) A simple logistic regression model trained on the right data can often outperform, in both accuracy and fairness, most fair classifiers trained on biased training data, and (3) A few, simple fairness techniques (e.g., reweighing, exponentiated gradients) seem to offer stable accuracy and fairness guarantees even when their training data is injected with under-representation and label bias. Our experiments also show how to integrate a measure of data bias risk in the existing fairness dashboards for real-world deployments  ( 2 min )
    Maneuver Decision-Making For Autonomous Air Combat Through Curriculum Learning And Reinforcement Learning With Sparse Rewards. (arXiv:2302.05838v1 [cs.AI])
    Reinforcement learning is an effective way to solve the decision-making problems. It is a meaningful and valuable direction to investigate autonomous air combat maneuver decision-making method based on reinforcement learning. However, when using reinforcement learning to solve the decision-making problems with sparse rewards, such as air combat maneuver decision-making, it costs too much time for training and the performance of the trained agent may not be satisfactory. In order to solve these problems, the method based on curriculum learning is proposed. First, three curricula of air combat maneuver decision-making are designed: angle curriculum, distance curriculum and hybrid curriculum. These courses are used to train air combat agents respectively, and compared with the original method without any curriculum. The training results show that angle curriculum can increase the speed and stability of training, and improve the performance of the agent; distance curriculum can increase the speed and stability of agent training; hybrid curriculum has a negative impact on training, because it makes the agent get stuck at local optimum. The simulation results show that after training, the agent can handle the situations where targets come from different directions, and the maneuver decision results are consistent with the characteristics of missile.  ( 2 min )
    Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations. (arXiv:2107.00520v5 [cs.LG] UPDATED)
    In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.  ( 2 min )
    Information-Directed Selection for Top-Two Algorithms. (arXiv:2205.12086v2 [stat.ML] UPDATED)
    We consider the best-k-arm identification problem for multi-armed bandits, where the objective is to select the exact set of k arms with the highest mean rewards by sequentially allocating measurement effort. We characterize the necessary and sufficient conditions for the optimal allocation using dual variables. Remarkably these optimality conditions lead to the extension of top-two algorithm design principle (Russo, 2020), initially proposed for best-arm identification. Furthermore, our optimality conditions induce a simple and effective selection rule dubbed information-directed selection (IDS) that selects one of the top-two candidates based on a measure of information gain. As a theoretical guarantee, we prove that integrated with IDS, top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a glaring open problem in the pure exploration literature (Russo, 2020). As a by-product, we show that for k > 1, top-two algorithms cannot achieve optimality even with an oracle tuning parameter. Numerical experiments show the superior performance of the proposed top-two algorithms with IDS and considerable improvement compared with algorithms without adaptive selection.  ( 2 min )
    DIWIFT: Discovering Instance-wise Influential Features for Tabular Data. (arXiv:2207.02773v2 [cs.LG] UPDATED)
    Tabular data is one of the most common data storage formats behind many real-world web applications such as retail, banking, and e-commerce. The success of these web applications largely depends on the ability of the employed machine learning model to accurately distinguish influential features from all the predetermined features in tabular data. Intuitively, in practical business scenarios, different instances should correspond to different sets of influential features, and the set of influential features of the same instance may vary in different scenarios. However, most existing methods focus on global feature selection assuming that all instances have the same set of influential features, and few methods considering instance-wise feature selection ignore the variability of influential features in different scenarios. In this paper, we first introduce a new perspective based on the influence function for instance-wise feature selection, and give some corresponding theoretical insights, the core of which is to use the influence function as an indicator to measure the importance of an instance-wise feature. We then propose a new solution for discovering instance-wise influential features in tabular data (DIWIFT), where a self-attention network is used as a feature selection model and the value of the corresponding influence function is used as an optimization objective to guide the model. Benefiting from the advantage of the influence function, i.e., its computation does not depend on a specific architecture and can also take into account the data distribution in different scenarios, our DIWIFT has better flexibility and robustness. Finally, we conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our DIWIFT.  ( 2 min )
    SafeLight: A Reinforcement Learning Method toward Collision-free Traffic Signal Control. (arXiv:2211.10871v2 [cs.LG] UPDATED)
    Traffic signal control is safety-critical for our daily life. Roughly one-quarter of road accidents in the U.S. happen at intersections due to problematic signal timing, urging the development of safety-oriented intersection control. However, existing studies on adaptive traffic signal control using reinforcement learning technologies have focused mainly on minimizing traffic delay but neglecting the potential exposure to unsafe conditions. We, for the first time, incorporate road safety standards as enforcement to ensure the safety of existing reinforcement learning methods, aiming toward operating intersections with zero collisions. We have proposed a safety-enhanced residual reinforcement learning method (SafeLight) and employed multiple optimization techniques, such as multi-objective loss function and reward shaping for better knowledge integration. Extensive experiments are conducted using both synthetic and real-world benchmark datasets. Results show that our method can significantly reduce collisions while increasing traffic mobility.  ( 2 min )
    Collaboration-Aware Graph Convolutional Network for Recommender Systems. (arXiv:2207.06221v3 [cs.IR] UPDATED)
    Graph Neural Networks (GNNs) have been successfully adopted in recommender systems by virtue of the message-passing that implicitly captures collaborative effect. Nevertheless, most of the existing message-passing mechanisms for recommendation are directly inherited from GNNs without scrutinizing whether the captured collaborative effect would benefit the prediction of user preferences. In this paper, we first analyze how message-passing captures the collaborative effect and propose a recommendation-oriented topological metric, Common Interacted Ratio (CIR), which measures the level of interaction between a specific neighbor of a node with the rest of its neighbors. After demonstrating the benefits of leveraging collaborations from neighbors with higher CIR, we propose a recommendation-tailored GNN, Collaboration-Aware Graph Convolutional Network (CAGCN), that goes beyond 1-Weisfeiler-Lehman(1-WL) test in distinguishing non-bipartite-subgraph-isomorphic graphs. Experiments on six benchmark datasets show that the best CAGCN variant outperforms the most representative GNN-based recommendation model, LightGCN, by nearly 10\% in Recall@20 and also achieves around 80\% speedup. Our code is publicly available at https://github.com/YuWVandy/CAGCN.  ( 2 min )
    Vector Quantized Wasserstein Auto-Encoder. (arXiv:2302.05917v1 [cs.LG])
    Learning deep discrete latent presentations offers a promise of better symbolic and summarized abstractions that are more useful to subsequent downstream tasks. Inspired by the seminal Vector Quantized Variational Auto-Encoder (VQ-VAE), most of work in learning deep discrete representations has mainly focused on improving the original VQ-VAE form and none of them has studied learning deep discrete representations from the generative viewpoint. In this work, we study learning deep discrete representations from the generative viewpoint. Specifically, we endow discrete distributions over sequences of codewords and learn a deterministic decoder that transports the distribution over the sequences of codewords to the data distribution via minimizing a WS distance between them. We develop further theories to connect it with the clustering viewpoint of WS distance, allowing us to have a better and more controllable clustering solution. Finally, we empirically evaluate our method on several well-known benchmarks, where it achieves better qualitative and quantitative performances than the other VQ-VAE variants in terms of the codebook utilization and image reconstruction/generation.  ( 2 min )
    A Reparameterized Discrete Diffusion Model for Text Generation. (arXiv:2302.05737v1 [cs.CL])
    This work studies discrete diffusion probabilistic models with applications to natural language generation. We derive an alternative yet equivalent formulation of the sampling from discrete diffusion processes and leverage this insight to develop a family of reparameterized discrete diffusion models. The derived generic framework is highly flexible, offers a fresh perspective of the generation process in discrete diffusion models, and features more effective training and decoding techniques. We conduct extensive experiments to evaluate the text generation capability of our model, demonstrating significant improvements over existing diffusion models.  ( 2 min )
    Regret Guarantees for Adversarial Online Collaborative Filtering. (arXiv:2302.05765v1 [cs.LG])
    We investigate the problem of online collaborative filtering under no-repetition constraints, whereby users need to be served content in an online fashion and a given user cannot be recommended the same content item more than once. We design and analyze a fully adaptive algorithm that works under biclustering assumptions on the user-item preference matrix, and show that this algorithm exhibits an optimal regret guarantee, while being oblivious to any prior knowledge about the sequence of users, the universe of items, as well as the biclustering parameters of the preference matrix. We further propose a more robust version of the algorithm which addresses the scenario when the preference matrix is adversarially perturbed. We then give regret guarantees that scale with the amount by which the preference matrix is perturbed from a biclustered structure. To our knowledge, these are the first results on online collaborative filtering that hold at this level of generality and adaptivity under no-repetition constraints.  ( 2 min )
    The NLP Task Effectiveness of Long-Range Transformers. (arXiv:2202.07856v2 [cs.CL] UPDATED)
    Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors to illuminate model details beyond metric scores. We find that the modified attention in long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens and accumulated approximation error.  ( 2 min )
    Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes. (arXiv:2302.05752v1 [cs.LG])
    Medical experts may use Artificial Intelligence (AI) systems with greater trust if these are supported by contextual explanations that let the practitioner connect system inferences to their context of use. However, their importance in improving model usage and understanding has not been extensively studied. Hence, we consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state, AI predictions about their risk of complications, and algorithmic explanations supporting the predictions. We explore how relevant information for such dimensions can be extracted from Medical guidelines to answer typical questions from clinical practitioners. We identify this as a question answering (QA) task and employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability. Finally, we study the benefits of contextual explanations by building an end-to-end AI pipeline including data cohorting, AI risk modeling, post-hoc model explanations, and prototyped a visual dashboard to present the combined insights from different context dimensions and data sources, while predicting and identifying the drivers of risk of Chronic Kidney Disease - a common type-2 diabetes comorbidity. All of these steps were performed in engagement with medical experts, including a final evaluation of the dashboard results by an expert medical panel. We show that LLMs, in particular BERT and SciBERT, can be readily deployed to extract some relevant explanations to support clinical usage. To understand the value-add of the contextual explanations, the expert panel evaluated these regarding actionable insights in the relevant clinical setting. Overall, our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.  ( 3 min )
    A Comparison Study of Deep CNN Architecture in Detecting of Pneumonia. (arXiv:2212.14744v2 [eess.IV] UPDATED)
    Pneumonia, a respiratory infection brought on by bacteria or viruses, affects a large number of people, especially in developing and impoverished countries where high levels of pollution, unclean living conditions, and overcrowding are frequently observed, along with insufficient medical infrastructure. Pleural effusion, a condition in which fluids fill the lung and complicate breathing, is brought on by pneumonia. Early detection of pneumonia is essential for ensuring curative care and boosting survival rates. The approach most usually used to diagnose pneumonia is chest X-ray imaging. The purpose of this work is to develop a method for the automatic diagnosis of bacterial and viral pneumonia in digital x-ray pictures. This article first presents the authors' technique, and then gives a comprehensive report on recent developments in the field of reliable diagnosis of pneumonia. In this study, here tuned a state-of-the-art deep convolutional neural network to classify plant diseases based on images and tested its performance. Deep learning architecture is compared empirically. VGG19, ResNet with 152v2, Resnext101, Seresnet152, Mobilenettv2, and DenseNet with 201 layers are among the architectures tested. Experiment data consists of two groups, sick and healthy X-ray pictures. To take appropriate action against plant diseases as soon as possible, rapid disease identification models are preferred. DenseNet201 has shown no overfitting or performance degradation in our experiments, and its accuracy tends to increase as the number of epochs increases. Further, DenseNet201 achieves state-of-the-art performance with a significantly a smaller number of parameters and within a reasonable computing time. This architecture outperforms the competition in terms of testing accuracy, scoring 95%. Each architecture was trained using Keras, using Theano as the backend.  ( 3 min )
    Multi-dimensional discrimination in Law and Machine Learning -- A comparative overview. (arXiv:2302.05995v1 [cs.LG])
    AI-driven decision-making can lead to discrimination against certain individuals or social groups based on protected characteristics/attributes such as race, gender, or age. The domain of fairness-aware machine learning focuses on methods and algorithms for understanding, mitigating, and accounting for bias in AI/ML models. Still, thus far, the vast majority of the proposed methods assess fairness based on a single protected attribute, e.g. only gender or race. In reality, though, human identities are multi-dimensional, and discrimination can occur based on more than one protected characteristic, leading to the so-called ``multi-dimensional discrimination'' or ``multi-dimensional fairness'' problem. While well-elaborated in legal literature, the multi-dimensionality of discrimination is less explored in the machine learning community. Recent approaches in this direction mainly follow the so-called intersectional fairness definition from the legal domain, whereas other notions like additive and sequential discrimination are less studied or not considered thus far. In this work, we overview the different definitions of multi-dimensional discrimination/fairness in the legal domain as well as how they have been transferred/ operationalized (if) in the fairness-aware machine learning domain. By juxtaposing these two domains, we draw the connections, identify the limitations, and point out open research directions.  ( 2 min )
    Event-Triggered Time-Varying Bayesian Optimization. (arXiv:2208.10790v2 [cs.LG] UPDATED)
    We consider the problem of sequentially optimizing a time-varying objective function using time-varying Bayesian optimization (TVBO). Here, the key challenge is the exploration-exploitation trade-off under time variations. Current approaches to TVBO require prior knowledge of a constant rate of change. However, the rate of change is usually neither known nor constant. We propose an event-triggered algorithm, ET-GP-UCB, that treats the optimization problem as static until it detects changes in the objective function online and then resets the dataset. This allows the algorithm to adapt to realized temporal changes without the need for prior knowledge. The event-trigger is based on probabilistic uniform error bounds used in Gaussian process regression. We provide regret bounds for ET-GP-UCB and show in numerical experiments that it is competitive with state-of-the-art algorithms even though it requires no knowledge about the temporal changes. Further, ET-GP-UCB outperforms these baselines if the rate of change is misspecified, and we demonstrate that it is readily applicable to various settings without tuning hyperparameters.  ( 2 min )
    Interpretable Diversity Analysis: Visualizing Feature Representations In Low-Cost Ensembles. (arXiv:2302.05822v1 [cs.LG])
    Diversity is an important consideration in the construction of robust neural network ensembles. A collection of well trained models will generalize better if they are diverse in the patterns they respond to and the predictions they make. Diversity is especially important for low-cost ensemble methods because members often share network structure in order to avoid training several independent models from scratch. Diversity is traditionally analyzed by measuring differences between the outputs of models. However, this gives little insight into how knowledge representations differ between ensemble members. This paper introduces several interpretability methods that can be used to qualitatively analyze diversity. We demonstrate these techniques by comparing the diversity of feature representations between child networks using two low-cost ensemble algorithms, Snapshot Ensembles and Prune and Tune Ensembles. We use the same pre-trained parent network as a starting point for both methods which allows us to explore how feature representations evolve over time. This approach to diversity analysis can lead to valuable insights and new perspectives for how we measure and promote diversity in ensemble methods.  ( 2 min )
    A Characterization of Multioutput Learnability. (arXiv:2301.02729v2 [cs.LG] UPDATED)
    We consider the problem of learning multioutput function classes in batch and online settings. In both settings, we show that a multioutput function class is learnable if and only if each single-output restriction of the function class is learnable. This provides a complete characterization of the learnability of multilabel classification and multioutput regression in both batch and online settings. As an extension, we also consider multilabel learnability in the bandit feedback setting and show a similar characterization as in the full-feedback setting.  ( 2 min )
    Chaotic Hedging with Iterated Integrals and Neural Networks. (arXiv:2209.10166v2 [q-fin.MF] UPDATED)
    In this paper, we extend the Wiener-Ito chaos decomposition to the class of diffusion processes, whose drift and diffusion coefficient are of linear growth. By omitting the orthogonality in the chaos expansion, we are able to show that every $p$-integrable functional, for $p \in [1,\infty)$, can be represented as sum of iterated integrals of the underlying process. Using a truncated sum of this expansion and (possibly random) neural networks for the integrands, whose parameters are learned in a machine learning setting, we show that every financial derivative can be approximated arbitrarily well in the $L^p$-sense. Since the hedging strategy of the approximating option can be computed in closed form, we obtain an efficient algorithm that can replicate any integrable financial derivative with short runtime.  ( 2 min )
    Communication and Storage Efficient Federated Split Learning. (arXiv:2302.05599v1 [cs.IT])
    Federated learning (FL) is a popular distributed machine learning (ML) paradigm, but is often limited by significant communication costs and edge device computation capabilities. Federated Split Learning (FSL) preserves the parallel model training principle of FL, with a reduced device computation requirement thanks to splitting the ML model between the server and clients. However, FSL still incurs very high communication overhead due to transmitting the smashed data and gradients between the clients and the server in each global round. Furthermore, the server has to maintain separate models for every client, resulting in a significant computation and storage requirement that grows linearly with the number of clients. This paper tries to solve these two issues by proposing a communication and storage efficient federated and split learning (CSE-FSL) strategy, which utilizes an auxiliary network to locally update the client models while keeping only a single model at the server, hence avoiding the communication of gradients from the server and greatly reducing the server resource requirement. Communication cost is further reduced by only sending the smashed data in selected epochs from the clients. We provide a rigorous theoretical analysis of CSE-FSL that guarantees its convergence for non-convex loss functions. Extensive experimental results demonstrate that CSE-FSL has a significant communication reduction over existing FSL techniques while achieving state-of-the-art convergence and model accuracy, using several real-world FL tasks.  ( 2 min )
    Sequential Embedding-based Attentive (SEA) classifier for malware classification. (arXiv:2302.05728v1 [cs.CR])
    The tremendous growth in smart devices has uplifted several security threats. One of the most prominent threats is malicious software also known as malware. Malware has the capability of corrupting a device and collapsing an entire network. Therefore, its early detection and mitigation are extremely important to avoid catastrophic effects. In this work, we came up with a solution for malware detection using state-of-the-art natural language processing (NLP) techniques. Our main focus is to provide a lightweight yet effective classifier for malware detection which can be used for heterogeneous devices, be it a resource constraint device or a resourceful machine. Our proposed model is tested on the benchmark data set with an accuracy and log loss score of 99.13 percent and 0.04 respectively.  ( 2 min )
    Variants of SGD for Lipschitz Continuous Loss Functions in Low-Precision Environments. (arXiv:2211.04655v3 [math.OC] UPDATED)
    Motivated by neural network training in low-bit floating and fixed-point environments, this work studies the convergence of variants of SGD with computational error. Considering a general stochastic Lipschitz continuous loss function, a novel convergence result to a Clarke stationary point is presented assuming that only an approximation of its stochastic gradient can be computed as well as error in computing the SGD step itself. Different variants of SGD are then tested empirically in a variety of low-precision arithmetic environments, where improved test set accuracy is observed compared to SGD for two image recognition tasks.  ( 2 min )
    Direct Uncertainty Quantification. (arXiv:2302.02420v2 [cs.LG] UPDATED)
    Traditional neural networks are simple to train but they produce overconfident predictions, while Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming. This paper introduces a new approach, direct uncertainty quantification (DirectUQ), that combines their advantages where the neural network directly models uncertainty in output space, and captures both aleatoric and epistemic uncertainty. DirectUQ can be derived as an alternative variational lower bound, and hence benefits from collapsed variational inference that provides improved regularizers. On the other hand, like non-probabilistic models, DirectUQ enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. Experiments show that DirectUQ and ensembles of DirectUQ provide a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.  ( 2 min )
    Hierarchical Stochastic Block Model for Community Detection in Multiplex Networks. (arXiv:1904.05330v3 [cs.SI] UPDATED)
    Multiplex networks have become increasingly more prevalent in many fields, and have emerged as a powerful tool for modeling the complexity of real networks. There is a critical need for developing inference models for multiplex networks that can take into account potential dependencies across different layers, particularly when the aim is community detection. We add to a limited literature by proposing a novel and efficient Bayesian model for community detection in multiplex networks. A key feature of our approach is the ability to model varying communities at different network layers. In contrast, many existing models assume the same communities for all layers. Moreover, our model automatically picks up the necessary number of communities at each layer (as validated by real data examples). This is appealing, since deciding the number of communities is a challenging aspect of community detection, and especially so in the multiplex setting, if one allows the communities to change across layers. Borrowing ideas from hierarchical Bayesian modeling, we use a hierarchical Dirichlet prior to model community labels across layers, allowing dependency in their structure. Given the community labels, a stochastic block model (SBM) is assumed for each layer. We develop an efficient slice sampler for sampling the posterior distribution of the community labels as well as the link probabilities between communities. In doing so, we address some unique challenges posed by coupling the complex likelihood of SBM with the hierarchical nature of the prior on the labels. An extensive empirical validation is performed on simulated and real data, demonstrating the superior performance of the model over single-layer alternatives, as well as the ability to uncover interesting structures in real networks.  ( 3 min )
    Quantum Neuron Selection: Finding High Performing Subnetworks With Quantum Algorithms. (arXiv:2302.05984v1 [cs.LG])
    Gradient descent methods have long been the de facto standard for training deep neural networks. Millions of training samples are fed into models with billions of parameters, which are slowly updated over hundreds of epochs. Recently, it's been shown that large, randomly initialized neural networks contain subnetworks that perform as well as fully trained models. This insight offers a promising avenue for training future neural networks by simply pruning weights from large, random models. However, this problem is combinatorically hard and classical algorithms are not efficient at finding the best subnetwork. In this paper, we explore how quantum algorithms could be formulated and applied to this neuron selection problem. We introduce several methods for local quantum neuron selection that reduce the entanglement complexity that large scale neuron selection would require, making this problem more tractable for current quantum hardware.  ( 2 min )
    A Survey on Spectral Graph Neural Networks. (arXiv:2302.05631v1 [cs.LG])
    Graph neural networks (GNNs) have attracted considerable attention from the research community. It is well established that GNNs are usually roughly divided into spatial and spectral methods. Despite that spectral GNNs play an important role in both graph signal processing and graph representation learning, existing studies are biased toward spatial approaches, and there is no comprehensive review on spectral GNNs so far. In this paper, we summarize the recent development of spectral GNNs, including model, theory, and application. Specifically, we first discuss the connection between spatial GNNs and spectral GNNs, which shows that spectral GNNs can capture global information and have better expressiveness and interpretability. Next, we categorize existing spectral GNNs according to the spectrum information they use, \ie, eigenvalues or eigenvectors. In addition, we review major theoretical results and applications of spectral GNNs, followed by a quantitative experiment to benchmark some popular spectral GNNs. Finally, we conclude the paper with some future directions.  ( 2 min )
    Element-Wise Attention Layers: an option for optimization. (arXiv:2302.05488v1 [cs.LG])
    The use of Attention Layers has become a trend since the popularization of the Transformer-based models, being the key element for many state-of-the-art models that have been developed through recent years. However, one of the biggest obstacles in implementing these architectures - as well as many others in Deep Learning Field - is the enormous amount of optimizing parameters they possess, which make its use conditioned on the availability of robust hardware. In this paper, it's proposed a new method of attention mechanism that adapts the Dot-Product Attention, which uses matrices multiplications, to become element-wise through the use of arrays multiplications. To test the effectiveness of such approach, two models (one with a VGG-like architecture and one with the proposed method) have been trained in a classification task using Fashion MNIST and CIFAR10 datasets. Each model has been trained for 10 epochs in a single Tesla T4 GPU from Google Colaboratory. The results show that this mechanism allows for an accuracy of 92% of the VGG-like counterpart in Fashion MNIST dataset, while reducing the number of parameters in 97%. For CIFAR10, the accuracy is still equivalent to 60% of the VGG-like counterpart while using 50% less parameters.  ( 2 min )
    Long-Context Language Decision Transformers and Exponential Tilt for Interactive Text Environments. (arXiv:2302.05507v1 [cs.CL])
    Text-based game environments are challenging because agents must deal with long sequences of text, execute compositional actions using text and learn from sparse rewards. We address these challenges by proposing Long-Context Language Decision Transformers (LLDTs), a framework that is based on long transformer language models and decision transformers (DTs). LLDTs extend DTs with 3 components: (1) exponential tilt to guide the agent towards high obtainable goals, (2) novel goal conditioning methods yielding significantly better results than the traditional return-to-go (sum of all future rewards), and (3) a model of future observations. Our ablation results show that predicting future observations improves agent performance. To the best of our knowledge, LLDTs are the first to address offline RL with DTs on these challenging games. Our experiments show that LLDTs achieve the highest scores among many different types of agents on some of the most challenging Jericho games, such as Enchanter.  ( 2 min )
    Pruning Deep Neural Networks from a Sparsity Perspective. (arXiv:2302.05601v1 [cs.LG])
    In recent years, deep network pruning has attracted significant attention in order to enable the rapid deployment of AI into small devices with computation and memory constraints. Pruning is often achieved by dropping redundant weights, neurons, or layers of a deep network while attempting to retain a comparable test performance. Many deep pruning algorithms have been proposed with impressive empirical success. However, existing approaches lack a quantifiable measure to estimate the compressibility of a sub-network during each pruning iteration and thus may under-prune or over-prune the model. In this work, we propose PQ Index (PQI) to measure the potential compressibility of deep neural networks and use this to develop a Sparsity-informed Adaptive Pruning (SAP) algorithm. Our extensive experiments corroborate the hypothesis that for a generic pruning procedure, PQI decreases first when a large model is being effectively regularized and then increases when its compressibility reaches a limit that appears to correspond to the beginning of underfitting. Subsequently, PQI decreases again when the model collapse and significant deterioration in the performance of the model start to occur. Additionally, our experiments demonstrate that the proposed adaptive pruning algorithm with proper choice of hyper-parameters is superior to the iterative pruning algorithms such as the lottery ticket-based pruning methods, in terms of both compression efficiency and robustness.  ( 2 min )
    On the equivalence between graph isomorphism testing and function approximation with GNNs. (arXiv:1905.12560v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved much success on graph-structured data. In light of this, there have been increasing interests in studying their expressive power. One line of work studies the capability of GNNs to approximate permutation-invariant functions on graphs, and another focuses on the their power as tests for graph isomorphism. Our work connects these two perspectives and proves their equivalence. We further develop a framework of the expressive power of GNNs that incorporates both of these viewpoints using the language of sigma-algebra, through which we compare the expressive power of different types of GNNs together with other graph isomorphism tests. In particular, we prove that the second-order Invariant Graph Network fails to distinguish non-isomorphic regular graphs with the same degree. Then, we extend it to a new architecture, Ring-GNN, which succeeds in distinguishing these graphs and achieves good performances on real-world datasets.  ( 2 min )
    Novel techniques for improving NNetEn entropy calculation for short and noisy time series. (arXiv:2202.12703v2 [cs.LG] UPDATED)
    Entropy is a fundamental concept in the field of information theory. During measurement, conventional entropy measures are susceptible to length and amplitude changes in time series. A new entropy metric, neural network entropy (NNetEn), has been developed to overcome these limitations. NNetEn entropy is computed using a modified LogNNet neural network classification model. The algorithm contains a reservoir matrix of N=19625 elements that must be filled with the given data. The contribution of this paper is threefold. Firstly, this work investigates different methods of filling the reservoir with time series (signal) elements. The reservoir filling method determines the accuracy of the entropy estimation by convolution of the study time series and LogNNet test data. The present study proposes 6 methods for filling the reservoir for time series. Two of them (Method 3 and Method 6) employ the novel approach of stretching the time series to create intermediate elements that complement it, but do not change its dynamics. The most reliable methods for short time series are Method 3 and Method 5. The second part of the study examines the influence of noise and constant bias on entropy values. Our study examines three different time series data types (chaotic, periodic, and binary) with different dynamic properties, Signal to Noise Ratio (SNR), and offsets. The NNetEn entropy calculation errors are less than 10% when SNR is greater than 30 dB, and entropy decreases with an increase in the bias component. The third part of the article analyzes real-time biosignal EEG data collected from emotion recognition experiments. The NNetEn measures show robustness under low-amplitude noise using various filters. Thus, NNetEn measures entropy effectively when applied to real-world environments with ambient noise, white noise, and 1/f noise.  ( 3 min )
    LIMEtree: Consistent and Faithful Surrogate Explanations of Multiple Classes. (arXiv:2005.01427v2 [cs.LG] UPDATED)
    Explainable machine learning provides tools to better understand predictive models and their decisions, but many such methods are limited to producing insights with respect to a single class. When generating explanations for several classes, reasoning over them to obtain a complete view may be difficult since they can present competing or contradictory evidence. To address this issue we introduce a novel paradigm of multi-class explanations. We outline the theory behind such techniques and propose a local surrogate model based on multi-output regression trees -- called LIMEtree -- which offers faithful and consistent explanations of multiple classes for individual predictions while being post-hoc, model-agnostic and data-universal. In addition to strong fidelity guarantees, our implementation supports (interactive) customisation of the explanatory insights and delivers a range of diverse explanation types, including counterfactual statements favoured in the literature. We evaluate our algorithm with a collection of quantitative experiments, a qualitative analysis based on explainability desiderata and a preliminary user study on an image classification task, comparing it to LIME. Our contributions demonstrate the benefits of multi-class explanations and wide-ranging advantages of our method across a diverse set scenarios.  ( 2 min )
    On Proper Learnability between Average- and Worst-case Robustness. (arXiv:2211.05656v4 [cs.LG] UPDATED)
    Recently, \cite{montasser2019vc} showed that finite VC dimension is not sufficient for \textit{proper} adversarially robust PAC learning. In light of this hardness result, there is a growing effort to study what type of relaxations to the adversarially robust PAC learning setup can enable proper learnability. In this work, we initiate the study of proper learning under relaxations of the worst-case robust loss. We give a family of robust loss relaxations under which VC classes are properly PAC learning with sample complexity close to what one would require in the standard PAC learning setup. On the other hand, we show that for an existing and natural relaxation of the worst-case robust loss, finite VC dimension is not sufficient for proper learning. Lastly, we give new generalization guarantees for the adversarially robust empirical risk minimizer.  ( 2 min )
    NASRec: Weight Sharing Neural Architecture Search for Recommender Systems. (arXiv:2207.07187v2 [cs.IR] UPDATED)
    The rise of deep neural networks offers new opportunities in optimizing recommender systems. However, optimizing recommender systems using deep neural networks requires delicate architecture fabrication. We propose NASRec, a paradigm that trains a single supernet and efficiently produces abundant models/sub-architectures by weight sharing. To overcome the data multi-modality and architecture heterogeneity challenges in the recommendation domain, NASRec establishes a large supernet (i.e., search space) to search the full architectures. The supernet incorporates versatile choice of operators and dense connectivity to minimize human efforts for finding priors. The scale and heterogeneity in NASRec impose several challenges, such as training inefficiency, operator-imbalance, and degraded rank correlation. We tackle these challenges by proposing single-operator any-connection sampling, operator-balancing interaction modules, and post-training fine-tuning. Our crafted models, NASRecNet, show promising results on three Click-Through Rates (CTR) prediction benchmarks, indicating that NASRec outperforms both manually designed models and existing NAS methods with state-of-the-art performance. Our work is publicly available at https://github.com/facebookresearch/NasRec.  ( 2 min )
    Structure-aware Protein Self-supervised Learning. (arXiv:2204.04213v3 [cs.LG] UPDATED)
    Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without considering the important protein structural information. To this end, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme. Experiments on several supervised downstream tasks verify the effectiveness of our proposed method.  ( 2 min )
    Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels. (arXiv:2302.05666v1 [cs.CV])
    IoU losses are surrogates that directly optimize the Jaccard index. In semantic segmentation, IoU losses are shown to perform better with respect to the Jaccard index measure than pixel-wise losses such as the cross-entropy loss. The most notable IoU losses are the soft Jaccard loss and the Lovasz-Softmax loss. However, these losses are incompatible with soft labels which are ubiquitous in machine learning. In this paper, we propose Jaccard metric losses (JMLs), which are variants of the soft Jaccard loss, and are compatible with soft labels. With JMLs, we study two of the most popular use cases of soft labels: label smoothing and knowledge distillation. With a variety of architectures, our experiments show significant improvements over the cross-entropy loss on three semantic segmentation datasets (Cityscapes, PASCAL VOC and DeepGlobe Land), and our simple approach outperforms state-of-the-art knowledge distillation methods by a large margin. Our source code is available at: \href{https://github.com/zifuwanggg/JDML}{https://github.com/zifuwanggg/JDML}.  ( 2 min )
    Generating Counterfactual Hard Negative Samples for Graph Contrastive Learning. (arXiv:2207.00148v2 [cs.LG] UPDATED)
    Graph contrastive learning has emerged as a powerful tool for unsupervised graph representation learning. The key to the success of graph contrastive learning is to acquire high-quality positive and negative samples as contrasting pairs for the purpose of learning underlying structural semantics of the input graph. Recent works usually sample negative samples from the same training batch with the positive samples, or from an external irrelevant graph. However, a significant limitation lies in such strategies, which is the unavoidable problem of sampling false negative samples. In this paper, we propose a novel method to utilize \textbf{C}ounterfactual mechanism to generate artificial hard negative samples for \textbf{G}raph \textbf{C}ontrastive learning, namely \textbf{CGC}, which has a different perspective compared to those sampling-based strategies. We utilize counterfactual mechanism to produce hard negative samples, which ensures that the generated samples are similar to, but have labels that different from the positive sample. The proposed method achieves satisfying results on several datasets compared to some traditional unsupervised graph learning methods and some SOTA graph contrastive learning methods. We also conduct some supplementary experiments to give an extensive illustration of the proposed method, including the performances of CGC with different hard negative samples and evaluations for hard negative samples generated with different similarity measurements.  ( 2 min )
    De-Biasing Generative Models using Counterfactual Methods. (arXiv:2207.01575v3 [cs.LG] UPDATED)
    Variational autoencoders (VAEs) and other generative methods have garnered growing interest not just for their generative properties but also for the ability to dis-entangle a low-dimensional latent variable space. However, few existing generative models take causality into account. We propose a new decoder based framework named the Causal Counterfactual Generative Model (CCGM), which includes a partially trainable causal layer in which a part of a causal model can be learned without significantly impacting reconstruction fidelity. By learning the causal relationships between image semantic labels or tabular variables, we can analyze biases, intervene on the generative model, and simulate new scenarios. Furthermore, by modifying the causal structure, we can generate samples outside the domain of the original training data and use such counterfactual models to de-bias datasets. Thus, datasets with known biases can still be used to train the causal generative model and learn the causal relationships, but we can produce de-biased datasets on the generative side. Our proposed method combines a causal latent space VAE model with specific modification to emphasize causal fidelity, enabling finer control over the causal layer and the ability to learn a robust intervention framework. We explore how better disentanglement of causal learning and encoding/decoding generates higher causal intervention quality. We also compare our model against similar research to demonstrate the need for explicit generative de-biasing beyond interventions. Our initial experiments show that our model can generate images and tabular data with high fidelity to the causal framework and accommodate explicit de-biasing to ignore undesired relationships in the causal data compared to the baseline.  ( 2 min )
    Which Invariance Should We Transfer? A Causal Minimax Learning Approach. (arXiv:2107.01876v3 [stat.ML] UPDATED)
    A major barrier to deploying current machine learning models lies in their non-reliability to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Particularly, independent causal mechanisms-based methods proposed to remove mutable causal mechanisms via the do-operator. Compared to previous methods, the obtained stable predictors are more effective in identifying stable information. However, a key question remains: which subset of this whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we present a comprehensive minimax analysis from a causal perspective. Specifically, we first provide a graphical condition for the whole stable set to be optimal. When this condition fails, we surprisingly find with an example that this whole stable set, although can fully exploit stable information, is not the optimal one to transfer. To identify the optimal subset under this case, we propose to estimate the worst-case risk with a novel optimization scheme over the intervention functions on mutable causal mechanisms. We then propose an efficient algorithm to search for the subset with minimal worst-case risk, based on a newly defined equivalence relation between stable subsets. Compared to the exponential cost of exhaustively searching over all subsets, our searching strategy enjoys a polynomial complexity. The effectiveness and efficiency of our methods are demonstrated on synthetic data and the diagnosis of Alzheimer's disease.  ( 2 min )
    Locating disparities in machine learning. (arXiv:2208.06680v2 [cs.LG] UPDATED)
    Machine learning was repeatedly proven to provide predictions with disparate outcomes, in which subgroups of the population (e.g., defined by age, gender, or other sensitive attributes) are systematically disadvantaged. Previous literature has focused on detecting such disparities through statistical procedures for when the sensitive attribute is specified a priori. However, this limits applicability in real-world settings where datasets are high dimensional and, on top of that, sensitive attributes may be unknown. As a remedy, we propose a data-driven framework called Automatic Location of Disparities (ALD) which aims at locating disparities in machine learning. ALD meets several demands from machine learning practice: ALD (1) is applicable to arbitrary machine learning classifiers; (2) operates on different definitions of disparities (e.g., statistical parity or equalized odds); (3) deals with both categorical and continuous predictors; (4) is suitable to handle high-dimensional settings; and (5) even identifies disparities due to intersectionality where disparities arise from complex and multi-way interactions (e.g., age above 60 and female). ALD produces interpretable fairness reports as output. We demonstrate the effectiveness of ALD based on both synthetic and real-world datasets. As a result, ALD helps practitioners and researchers of algorithmic fairness to detect disparities in machine learning algorithms, so that disparate -- or even unfair -- outcomes can be mitigated. Moreover, ALD supports practitioners in conducting algorithmic audits and protecting individuals from discrimination.  ( 2 min )
    Scaling Laws for a Multi-Agent Reinforcement Learning Model. (arXiv:2210.00849v2 [cs.LG] UPDATED)
    The recent observation of neural power-law scaling relations has made a significant impact in the field of deep learning. A substantial amount of attention has been dedicated as a consequence to the description of scaling laws, although mostly for supervised learning and only to a reduced extent for reinforcement learning frameworks. In this paper we present an extensive study of performance scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the basis of a relationship between Elo rating, playing strength and power-law scaling, we train AlphaZero agents on the games Connect Four and Pentago and analyze their performance. We find that player strength scales as a power law in neural network parameter count when not bottlenecked by available compute, and as a power of compute when training optimally sized agents. We observe nearly identical scaling exponents for both games. Combining the two observed scaling laws we obtain a power law relating optimal size to compute similar to the ones observed for language models. We find that the predicted scaling of optimal neural network size fits our data for both games. This scaling law implies that previously published state-of-the-art game-playing models are significantly smaller than their optimal size, given the respective compute budgets. We also show that large AlphaZero models are more sample efficient, performing better than smaller models with the same amount of training data.  ( 2 min )
    Transfer Learning for Bayesian Optimization: A Survey. (arXiv:2302.05927v1 [cs.LG])
    A wide spectrum of design and decision problems, including parameter tuning, A/B testing and drug design, intrinsically are instances of black-box optimization. Bayesian optimization (BO) is a powerful tool that models and optimizes such expensive "black-box" functions. However, at the beginning of optimization, vanilla Bayesian optimization methods often suffer from slow convergence issue due to inaccurate modeling based on few trials. To address this issue, researchers in the BO community propose to incorporate the spirit of transfer learning to accelerate optimization process, which could borrow strength from the past tasks (source tasks) to accelerate the current optimization problem (target task). This survey paper first summarizes transfer learning methods for Bayesian optimization from four perspectives: initial points design, search space design, surrogate model, and acquisition function. Then it highlights its methodological aspects and technical details for each approach. Finally, it showcases a wide range of applications and proposes promising future directions.  ( 2 min )
    I$^2$SB: Image-to-Image Schr\"odinger Bridge. (arXiv:2302.05872v1 [cs.CV])
    We propose Image-to-Image Schr\"odinger Bridge (I$^2$SB), a new class of conditional diffusion models that directly learn the nonlinear diffusion processes between two given distributions. These diffusion bridges are particularly useful for image restoration, as the degraded images are structurally informative priors for reconstructing the clean images. I$^2$SB belongs to a tractable class of Schr\"odinger bridge, the nonlinear extension to score-based models, whose marginal distributions can be computed analytically given boundary pairs. This results in a simulation-free framework for nonlinear diffusions, where the I$^2$SB training becomes scalable by adopting practical techniques used in standard diffusion models. We validate I$^2$SB in solving various image restoration tasks, including inpainting, super-resolution, deblurring, and JPEG restoration on ImageNet 256x256 and show that I$^2$SB surpasses standard conditional diffusion models with more interpretable generative processes. Moreover, I$^2$SB matches the performance of inverse methods that additionally require the knowledge of the corruption operators. Our work opens up new algorithmic opportunities for developing efficient nonlinear diffusion models on a large scale. scale. Project page: https://i2sb.github.io/  ( 2 min )
    Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD. (arXiv:2302.05516v1 [stat.ML])
    Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.  ( 2 min )
    A High-dimensional Convergence Theorem for U-statistics with Applications to Kernel-based Testing. (arXiv:2302.05686v1 [math.ST])
    We prove a convergence theorem for U-statistics of degree two, where the data dimension $d$ is allowed to scale with sample size $n$. We find that the limiting distribution of a U-statistic undergoes a phase transition from the non-degenerate Gaussian limit to the degenerate limit, regardless of its degeneracy and depending only on a moment ratio. A surprising consequence is that a non-degenerate U-statistic in high dimensions can have a non-Gaussian limit with a larger variance and asymmetric distribution. Our bounds are valid for any finite $n$ and $d$, independent of individual eigenvalues of the underlying function, and dimension-independent under a mild assumption. As an application, we apply our theory to two popular kernel-based distribution tests, MMD and KSD, whose high-dimensional performance has been challenging to study. In a simple empirical setting, our results correctly predict how the test power at a fixed threshold scales with $d$ and the bandwidth.  ( 2 min )
    Verifying Generalization in Deep Learning. (arXiv:2302.05745v1 [cs.LG])
    Deep neural networks (DNNs) are the workhorses of deep learning, which constitutes the state of the art in numerous application domains. However, DNN-based decision rules are notoriously prone to poor generalization, i.e., may prove inadequate on inputs not encountered during training. This limitation poses a significant obstacle to employing deep learning for mission-critical tasks, and also in real-world environments that exhibit high variability. We propose a novel, verification-driven methodology for identifying DNN-based decision rules that generalize well to new input domains. Our approach quantifies generalization to an input domain by the extent to which decisions reached by independently trained DNNs are in agreement for inputs in this domain. We show how, by harnessing the power of DNN verification, our approach can be efficiently and effectively realized. We evaluate our verification-based approach on three deep reinforcement learning (DRL) benchmarks, including a system for real-world Internet congestion control. Our results establish the usefulness of our approach, and, in particular, its superiority over gradient-based methods. More broadly, our work puts forth a novel objective for formal verification, with the potential for mitigating the risks associated with deploying DNN-based systems in the wild.  ( 2 min )
    NephroNet: A Novel Program for Identifying Renal Cell Carcinoma and Generating Synthetic Training Images with Convolutional Neural Networks and Diffusion Models. (arXiv:2302.05830v1 [eess.IV])
    Renal cell carcinoma (RCC) is a type of cancer that originates in the kidneys and is the most common type of kidney cancer in adults. It can be classified into several subtypes, including clear cell RCC, papillary RCC, and chromophobe RCC. In this study, an artificial intelligence model was developed and trained for classifying different subtypes of RCC using ResNet-18, a convolutional neural network that has been widely used for image classification tasks. The model was trained on a dataset of RCC histopathology images, which consisted of digital images of RCC surgical resection slides that were annotated with the corresponding subtype labels. The performance of the trained model was evaluated using several metrics, including accuracy, precision, and recall. Additionally, in this research, a novel synthetic image generation tool, NephroNet, is developed on diffusion models that are used to generate original images of RCC surgical resection slides. Diffusion models are a class of generative models capable of synthesizing high-quality images from noise. Several diffusers such as Stable Diffusion, Dreambooth Text-to-Image, and Textual Inversion were trained on a dataset of RCC images and were used to generate a series of original images that resembled RCC surgical resection slides, all within the span of fewer than four seconds. The generated images were visually realistic and could be used for creating new training datasets, testing the performance of image analysis algorithms, and training medical professionals. NephroNet is provided as an open-source software package and contains files for data preprocessing, training, and visualization. Overall, this study demonstrates the potential of artificial intelligence and diffusion models for classifying and generating RCC images, respectively. These methods could be useful for improving the diagnosis and treatment of RCC and more.  ( 3 min )
    ConCerNet: A Contrastive Learning Based Framework for Automated Conservation Law Discovery and Trustworthy Dynamical System Prediction. (arXiv:2302.05783v1 [cs.LG])
    Deep neural networks (DNN) have shown great capacity of modeling a dynamical system; nevertheless, they usually do not obey physics constraints such as conservation laws. This paper proposes a new learning framework named ConCerNet to improve the trustworthiness of the DNN based dynamics modeling to endow the invariant properties. ConCerNet consists of two steps: (i) a contrastive learning method to automatically capture the system invariants (i.e. conservation properties) along the trajectory observations; (ii) a neural projection layer to guarantee that the learned dynamics models preserve the learned invariants. We theoretically prove the functional relationship between the learned latent representation and the unknown system invariant function. Experiments show that our method consistently outperforms the baseline neural networks in both coordinate error and conservation metrics by a large margin. With neural network based parameterization and no dependence on prior knowledge, our method can be extended to complex and large-scale dynamics by leveraging an autoencoder.  ( 2 min )
    Neural Architecture Search with Multimodal Fusion Methods for Diagnosing Dementia. (arXiv:2302.05894v1 [cs.LG])
    Alzheimer's dementia (AD) affects memory, thinking, and language, deteriorating person's life. An early diagnosis is very important as it enables the person to receive medical help and ensure quality of life. Therefore, leveraging spontaneous speech in conjunction with machine learning methods for recognizing AD patients has emerged into a hot topic. Most of the previous works employ Convolutional Neural Networks (CNNs), to process the input signal. However, finding a CNN architecture is a time-consuming process and requires domain expertise. Moreover, the researchers introduce early and late fusion approaches for fusing different modalities or concatenate the representations of the different modalities during training, thus the inter-modal interactions are not captured. To tackle these limitations, first we exploit a Neural Architecture Search (NAS) method to automatically find a high performing CNN architecture. Next, we exploit several fusion methods, including Multimodal Factorized Bilinear Pooling and Tucker Decomposition, to combine both speech and text modalities. To the best of our knowledge, there is no prior work exploiting a NAS approach and these fusion methods in the task of dementia detection from spontaneous speech. We perform extensive experiments on the ADReSS Challenge dataset and show the effectiveness of our approach over state-of-the-art methods.  ( 2 min )
    Sequential Underspecified Instrument Selection for Cause-Effect Estimation. (arXiv:2302.05684v1 [stat.ME])
    Instrumental variable (IV) methods are used to estimate causal effects in settings with unobserved confounding, where we cannot directly experiment on the treatment variable. Instruments are variables which only affect the outcome indirectly via the treatment variable(s). Most IV applications focus on low-dimensional treatments and crucially require at least as many instruments as treatments. This assumption is restrictive: in the natural sciences we often seek to infer causal effects of high-dimensional treatments (e.g., the effect of gene expressions or microbiota on health and disease), but can only run few experiments with a limited number of instruments (e.g., drugs or antibiotics). In such underspecified problems, the full treatment effect is not identifiable in a single experiment even in the linear case. We show that one can still reliably recover the projection of the treatment effect onto the instrumented subspace and develop techniques to consistently combine such partial estimates from different sets of instruments. We then leverage our combined estimators in an algorithm that iteratively proposes the most informative instruments at each round of experimentation to maximize the overall information about the full causal effect.  ( 2 min )
    Distributional GFlowNets with Quantile Flows. (arXiv:2302.05793v1 [cs.LG])
    Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a stochastic policy for generating complex combinatorial structure through a series of decision-making steps. Despite being inspired from reinforcement learning, the current GFlowNet framework is relatively limited in its applicability and cannot handle stochasticity in the reward function. In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training. By parameterizing each edge flow through their quantile functions, our proposed \textit{quantile matching} GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty. Moreover, we find that the distributional approach can achieve substantial improvement on existing benchmarks compared to prior methods due to our enhanced training algorithm, even in settings with deterministic rewards.  ( 2 min )
    A Policy Gradient Framework for Stochastic Optimal Control Problems with Global Convergence Guarantee. (arXiv:2302.05816v1 [math.OC])
    In this work, we consider the stochastic optimal control problem in continuous time and a policy gradient method to solve it. In particular, we study the gradient flow for the control, viewed as a continuous time limit of the policy gradient. We prove the global convergence of the gradient flow and establish a convergence rate under some regularity assumptions. The main novelty in the analysis is the notion of local optimal control function, which is introduced to compare the local optimality of the iterate.  ( 2 min )
    Variational Voxel Pseudo Image Tracking. (arXiv:2302.05914v1 [cs.CV])
    Uncertainty estimation is an important task for critical problems, such as robotics and autonomous driving, because it allows creating statistically better perception models and signaling the model's certainty in its predictions to the decision method or a human supervisor. In this paper, we propose a Variational Neural Network-based version of a Voxel Pseudo Image Tracking (VPIT) method for 3D Single Object Tracking. The Variational Feature Generation Network of the proposed Variational VPIT computes features for target and search regions and the corresponding uncertainties, which are later combined using an uncertainty-aware cross-correlation module in one of two ways: by computing similarity between the corresponding uncertainties and adding it to the regular cross-correlation values, or by penalizing the uncertain feature channels to increase influence of the certain features. In experiments, we show that both methods improve tracking performance, while penalization of uncertain features provides the best uncertainty quality.  ( 2 min )
    Operation-level Progressive Differentiable Architecture Search. (arXiv:2302.05632v1 [cs.CV])
    Differentiable Neural Architecture Search (DARTS) is becoming more and more popular among Neural Architecture Search (NAS) methods because of its high search efficiency and low compute cost. However, the stability of DARTS is very inferior, especially skip connections aggregation that leads to performance collapse. Though existing methods leverage Hessian eigenvalues to alleviate skip connections aggregation, they make DARTS unable to explore architectures with better performance. In the paper, we propose operation-level progressive differentiable neural architecture search (OPP-DARTS) to avoid skip connections aggregation and explore better architectures simultaneously. We first divide the search process into several stages during the search phase and increase candidate operations into the search space progressively at the beginning of each stage. It can effectively alleviate the unfair competition between operations during the search phase of DARTS by offsetting the inherent unfair advantage of the skip connection over other operations. Besides, to keep the competition between operations relatively fair and select the operation from the candidate operations set that makes training loss of the supernet largest. The experiment results indicate that our method is effective and efficient. Our method's performance on CIFAR-10 is superior to the architecture found by standard DARTS, and the transferability of our method also surpasses standard DARTS. We further demonstrate the robustness of our method on three simple search spaces, i.e., S2, S3, S4, and the results show us that our method is more robust than standard DARTS. Our code is available at https://github.com/zxunyu/OPP-DARTS.  ( 2 min )
    Towards Multi-User Activity Recognition through Facilitated Training Data and Deep Learning for Human-Robot Collaboration Applications. (arXiv:2302.05763v1 [cs.LG])
    Human-robot interaction (HRI) research is progressively addressing multi-party scenarios, where a robot interacts with more than one human user at the same time. Conversely, research is still at an early stage for human-robot collaboration (HRC). The use of machine learning techniques to handle such type of collaboration requires data that are less feasible to produce than in a typical HRC setup. This work outlines concepts of design of concurrent tasks for non-dyadic HRC applications. Based upon these concepts, this study also proposes an alternative way of gathering data regarding multiuser activity, by collecting data related to single subjects and merging them in post-processing, to reduce the effort involved in producing recordings of pair settings. To validate this statement, 3D skeleton poses of activity of single subjects were collected and merged in pairs. After this, the datapoints were used to separately train a long short-term memory (LSTM) network and a variational autoencoder (VAE) composed of spatio-temporal graph convolutional networks (STGCN) to recognise the joint activities of the pairs of people. The results showed that it is possible to make use of data collected in this way for pair HRC settings and get similar performances compared to using data regarding groups of users recorded under the same settings, relieving from the technical difficulties involved in producing these data.  ( 2 min )
    UGAE: A Novel Approach to Non-exponential Discounting. (arXiv:2302.05740v1 [cs.LG])
    The discounting mechanism in Reinforcement Learning determines the relative importance of future and present rewards. While exponential discounting is widely used in practice, non-exponential discounting methods that align with human behavior are often desirable for creating human-like agents. However, non-exponential discounting methods cannot be directly applied in modern on-policy actor-critic algorithms. To address this issue, we propose Universal Generalized Advantage Estimation (UGAE), which allows for the computation of GAE advantage values with arbitrary discounting. Additionally, we introduce Beta-weighted discounting, a continuous interpolation between exponential and hyperbolic discounting, to increase flexibility in choosing a discounting method. To showcase the utility of UGAE, we provide an analysis of the properties of various discounting methods. We also show experimentally that agents with non-exponential discounting trained via UGAE outperform variants trained with Monte Carlo advantage estimation. Through analysis of various discounting methods and experiments, we demonstrate the superior performance of UGAE with Beta-weighted discounting over the Monte Carlo baseline on standard RL benchmarks. UGAE is simple and easily integrated into any advantage-based algorithm as a replacement for the standard recursive GAE.  ( 2 min )
    Theory on Forgetting and Generalization of Continual Learning. (arXiv:2302.05836v1 [cs.LG])
    Continual learning (CL), which aims to learn a sequence of tasks, has attracted significant recent attention. However, most work has focused on the experimental performance of CL, and theoretical studies of CL are still limited. In particular, there is a lack of understanding on what factors are important and how they affect "catastrophic forgetting" and generalization performance. To fill this gap, our theoretical analysis, under overparameterized linear models, provides the first-known explicit form of the expected forgetting and generalization error. Further analysis of such a key result yields a number of theoretical explanations about how overparameterization, task similarity, and task ordering affect both forgetting and generalization error of CL. More interestingly, by conducting experiments on real datasets using deep neural networks (DNNs), we show that some of these insights even go beyond the linear models and can be carried over to practical setups. In particular, we use concrete examples to show that our results not only explain some interesting empirical observations in recent studies, but also motivate better practical algorithm designs of CL.  ( 2 min )
    Position Matters! Empirical Study of Order Effect in Knowledge-grounded Dialogue. (arXiv:2302.05888v1 [cs.CL])
    With the power of large pretrained language models, various research works have integrated knowledge into dialogue systems. The traditional techniques treat knowledge as part of the input sequence for the dialogue system, prepending a set of knowledge statements in front of dialogue history. However, such a mechanism forces knowledge sets to be concatenated in an ordered manner, making models implicitly pay imbalanced attention to the sets during training. In this paper, we first investigate how the order of the knowledge set can influence autoregressive dialogue systems' responses. We conduct experiments on two commonly used dialogue datasets with two types of transformer-based models and find that models view the input knowledge unequally. To this end, we propose a simple and novel technique to alleviate the order effect by modifying the position embeddings of knowledge input in these models. With the proposed position embedding method, the experimental results show that each knowledge statement is uniformly considered to generate responses.  ( 2 min )
    Cross-Modal Fine-Tuning: Align then Refine. (arXiv:2302.05738v1 [cs.LG])
    Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. ORCA adapts to a target task via an align-then-refine workflow: given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities. Through extensive experiments, we show that ORCA obtains state-of-the-art results on 3 benchmarks containing over 60 datasets from 12 modalities, outperforming a wide range of hand-designed, AutoML, general-purpose, and task-specific methods. We highlight the importance of data alignment via a series of ablation studies and demonstrate ORCA's utility in data-limited regimes.  ( 2 min )
    Vertical Federated Knowledge Transfer via Representation Distillation for Healthcare Collaboration Networks. (arXiv:2302.05675v1 [cs.LG])
    Collaboration between healthcare institutions can significantly lessen the imbalance in medical resources across various geographic areas. However, directly sharing diagnostic information between institutions is typically not permitted due to the protection of patients' highly sensitive privacy. As a novel privacy-preserving machine learning paradigm, federated learning (FL) makes it possible to maximize the data utility among multiple medical institutions. These feature-enrichment FL techniques are referred to as vertical FL (VFL). Traditional VFL can only benefit multi-parties' shared samples, which strongly restricts its application scope. In order to improve the information-sharing capability and innovation of various healthcare-related institutions, and then to establish a next-generation open medical collaboration network, we propose a unified framework for vertical federated knowledge transfer mechanism (VFedTrans) based on a novel cross-hospital representation distillation component. Specifically, our framework includes three steps. First, shared samples' federated representations are extracted by collaboratively modeling multi-parties' joint features with current efficient vertical federated representation learning methods. Second, for each hospital, we learn a local-representation-distilled module, which can transfer the knowledge from shared samples' federated representations to enrich local samples' representations. Finally, each hospital can leverage local samples' representations enriched by the distillation module to boost arbitrary downstream machine learning tasks. The experiments on real-life medical datasets verify the knowledge transfer effectiveness of our framework.  ( 2 min )
    Tighter PAC-Bayes Bounds Through Coin-Betting. (arXiv:2302.05829v1 [cs.LG])
    We consider the problem of estimating the mean of a sequence of random elements $f(X_1, \theta)$ $, \ldots, $ $f(X_n, \theta)$ where $f$ is a fixed scalar function, $S=(X_1, \ldots, X_n)$ are independent random variables, and $\theta$ is a possibly $S$-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on $n$ examples where $f$ is a loss function. Classically, this problem is approached through concentration inequalities holding uniformly over compact parameter sets of functions $f$, for example as in Rademacher or VC type analysis. However, in many problems, such inequalities often yield numerically vacuous estimates. Recently, the \emph{PAC-Bayes} framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve \emph{even tighter} guarantees. Our approach is based on the \emph{coin-betting} framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms. In particular, we derive the first PAC-Bayes concentration inequality based on the coin-betting approach that holds simultaneously for all sample sizes. We demonstrate its tightness showing that by \emph{relaxing} it we obtain a number of previous results in a closed form including Bernoulli-KL and empirical Bernstein inequalities. Finally, we propose an efficient algorithm to numerically calculate confidence sequences from our bound, which often generates nonvacuous confidence bounds even with one sample, unlike the state-of-the-art PAC-Bayes bounds.  ( 2 min )
    Learning by Applying: A General Framework for Mathematical Reasoning via Enhancing Explicit Knowledge Learning. (arXiv:2302.05717v1 [cs.AI])
    Mathematical reasoning is one of the crucial abilities of general artificial intelligence, which requires machines to master mathematical logic and knowledge from solving problems. However, existing approaches are not transparent (thus not interpretable) in terms of what knowledge has been learned and applied in the reasoning process. In this paper, we propose a general Learning by Applying (LeAp) framework to enhance existing models (backbones) in a principled way by explicit knowledge learning. In LeAp, we perform knowledge learning in a novel problem-knowledge-expression paradigm, with a Knowledge Encoder to acquire knowledge from problem data and a Knowledge Decoder to apply knowledge for expression reasoning. The learned mathematical knowledge, including word-word relations and word-operator relations, forms an explicit knowledge graph, which bridges the knowledge "learning" and "applying" organically. Moreover, for problem solving, we design a semantics-enhanced module and a reasoning-enhanced module that apply knowledge to improve the problem comprehension and symbol reasoning abilities of any backbone, respectively. We theoretically prove the superiority of LeAp's autonomous learning mechanism. Experiments on three real-world datasets show that LeAp improves all backbones' performances, learns accurate knowledge, and achieves a more interpretable reasoning process.  ( 2 min )
    Graph Neural Network-Inspired Kernels for Gaussian Processes in Semi-Supervised Learning. (arXiv:2302.05828v1 [cs.LG])
    Gaussian processes (GPs) are an attractive class of machine learning models because of their simplicity and flexibility as building blocks of more complex Bayesian models. Meanwhile, graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning and beyond. Their competitive performance is often attributed to a proper capturing of the graph inductive bias. In this work, we introduce this inductive bias into GPs to improve their predictive performance for graph-structured data. We show that a prominent example of GNNs, the graph convolutional network, is equivalent to some GP when its layers are infinitely wide; and we analyze the kernel universality and the limiting behavior in depth. We further present a programmable procedure to compose covariance kernels inspired by this equivalence and derive example kernels corresponding to several interesting members of the GNN family. We also propose a computationally efficient approximation of the covariance matrix for scalable posterior inference with large-scale data. We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time, compared with the respective GNNs.  ( 2 min )
    Pushing the Accuracy-Group Robustness Frontier with Introspective Self-play. (arXiv:2302.05807v1 [cs.LG])
    Standard empirical risk minimization (ERM) training can produce deep neural network (DNN) models that are accurate on average but under-perform in under-represented population subgroups, especially when there are imbalanced group distributions in the long-tailed training data. Therefore, approaches that improve the accuracy-group robustness trade-off frontier of a DNN model (i.e. improving worst-group accuracy without sacrificing average accuracy, or vice versa) is of crucial importance. Uncertainty-based active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple "plug-in" for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods.  ( 2 min )
    Global Convergence Rate of Deep Equilibrium Models with General Activations. (arXiv:2302.05797v1 [stat.ML])
    In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation and proved that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. In this paper, we show that this fact still holds for DEQs with any general activation which has bounded first and second derivatives. Since the new activation function is generally non-linear, a general population Gram matrix is designed, and a new form of dual activation with Hermite polynomial expansion is developed.  ( 2 min )
    Multi-class Brain Tumor Segmentation using Graph Attention Network. (arXiv:2302.05598v1 [eess.IV])
    Brain tumor segmentation from magnetic resonance imaging (MRI) plays an important role in diagnostic radiology. To overcome the practical issues in manual approaches, there is a huge demand for building automatic tumor segmentation algorithms. This work introduces an efficient brain tumor summation model by exploiting the advancement in MRI and graph neural networks (GNNs). The model represents the volumetric MRI as a region adjacency graph (RAG) and learns to identify the type of tumors through a graph attention network (GAT) -- a variant of GNNs. The ablation analysis conducted on two benchmark datasets proves that the proposed model can produce competitive results compared to the leading-edge solutions. It achieves mean dice scores of 0.91, 0.86, 0.79, and mean Hausdorff distances in the 95th percentile (HD95) of 5.91, 6.08, and 9.52 mm, respectively, for whole tumor, core tumor, and enhancing tumor segmentation on BraTS2021 validation dataset. On average, these performances are >6\% and >50%, compared to a GNN-based baseline model, respectively, on dice score and HD95 evaluation metrics.  ( 2 min )
    CILP: Co-simulation based Imitation Learner for Dynamic Resource Provisioning in Cloud Computing Environments. (arXiv:2302.05630v1 [eess.SY])
    Intelligent Virtual Machine (VM) provisioning is central to cost and resource efficient computation in cloud computing environments. As bootstrapping VMs is time-consuming, a key challenge for latency-critical tasks is to predict future workload demands to provision VMs proactively. However, existing AI-based solutions \blue{tend to not holistically consider} all crucial aspects such as provisioning overheads, heterogeneous VM costs and Quality of Service (QoS) of the cloud system. To address this, we propose a novel method, called CILP, that formulates the VM provisioning problem as two sub-problems of prediction and optimization, where the provisioning plan is optimized based on predicted workload demands. CILP leverages a neural network as a surrogate model to predict future workload demands with a co-simulated digital-twin of the infrastructure to compute QoS scores. We extend the neural network to also act as an imitation learner that dynamically decides the optimal VM provisioning plan. A transformer based neural model reduces training and inference overheads while our novel two-phase decision making loop facilitates in making informed provisioning decisions. Crucially, we address limitations of prior work by including resource utilization, deployment costs and provisioning overheads to inform the provisioning decisions in our imitation learning framework. Experiments with three public benchmarks demonstrate that CILP gives up to 22% higher resource utilization, 14% higher QoS scores and 44% lower execution costs compared to the current online and offline optimization based state-of-the-art methods.  ( 2 min )
    Fairness-aware Multi-view Clustering. (arXiv:2302.05788v1 [cs.LG])
    In the era of big data, we are often facing the challenge of data heterogeneity and the lack of label information simultaneously. In the financial domain (e.g., fraud detection), the heterogeneous data may include not only numerical data (e.g., total debt and yearly income), but also text and images (e.g., financial statement and invoice images). At the same time, the label information (e.g., fraud transactions) may be missing for building predictive models. To address these challenges, many state-of-the-art multi-view clustering methods have been proposed and achieved outstanding performance. However, these methods typically do not take into consideration the fairness aspect and are likely to generate biased results using sensitive information such as race and gender. Therefore, in this paper, we propose a fairness-aware multi-view clustering method named FairMVC. It incorporates the group fairness constraint into the soft membership assignment for each cluster to ensure that the fraction of different groups in each cluster is approximately identical to the entire data set. Meanwhile, we adopt the idea of both contrastive learning and non-contrastive learning and propose novel regularizers to handle heterogeneous data in complex scenarios with missing data or noisy features. Experimental results on real-world data sets demonstrate the effectiveness and efficiency of the proposed framework. We also derive insights regarding the relative performance of the proposed regularizers in various scenarios.  ( 2 min )
    Stochastic Surprisal: An inferential measurement of Free Energy in Neural Networks. (arXiv:2302.05776v1 [cs.LG])
    This paper conjectures and validates a framework that allows for action during inference in supervised neural networks. Supervised neural networks are constructed with the objective to maximize their performance metric in any given task. This is done by reducing free energy and its associated surprisal during training. However, the bottom-up inference nature of supervised networks is a passive process that renders them fallible to noise. In this paper, we provide a thorough background of supervised neural networks, both generative and discriminative, and discuss their functionality from the perspective of free energy principle. We then provide a framework for introducing action during inference. We introduce a new measurement called stochastic surprisal that is a function of the network, the input, and any possible action. This action can be any one of the outputs that the neural network has learnt, thereby lending stochasticity to the measurement. Stochastic surprisal is validated on two applications: Image Quality Assessment and Recognition under noisy conditions. We show that, while noise characteristics are ignored to make robust recognition, they are analyzed to estimate image quality scores. We apply stochastic surprisal on two applications, three datasets, and as a plug-in on twelve networks. In all, it provides a statistically significant increase among all measures. We conclude by discussing the implications of the proposed stochastic surprisal in other areas of cognitive psychology including expectancy-mismatch and abductive reasoning.  ( 2 min )
    Emotion Detection From Social Media Posts. (arXiv:2302.05610v1 [cs.LG])
    Over the last few years, social media has evolved into a medium for expressing personal views, emotions, and even business and political proposals, recommendations, and advertisements. We address the topic of identifying emotions from text data obtained from social media posts like Twitter in this research. We have deployed different traditional machine learning techniques such as Support Vector Machines (SVM), Naive Bayes, Decision Trees, and Random Forest, as well as deep neural network models such as LSTM, CNN, GRU, BiLSTM, BiGRU to classify these tweets into four emotion categories (Fear, Anger, Joy, and Sadness). Furthermore, we have constructed a BiLSTM and BiGRU ensemble model. The evaluation result shows that the deep neural network models(BiGRU, to be specific) produce the most promising results compared to traditional machine learning models, with an 87.53 % accuracy rate. The ensemble model performs even better (87.66 %), albeit the difference is not significant. This result will aid in the development of a decision-making tool that visualizes emotional fluctuations.  ( 2 min )
    Predicting municipalities in financial distress: a machine learning approach enhanced by domain expertise. (arXiv:2302.05780v1 [cs.LG])
    Financial distress of municipalities, although comparable to bankruptcy of private companies, has a far more serious impact on the well-being of communities. For this reason, it is essential to detect deficits as soon as possible. Predicting financial distress in municipalities can be a complex task, as it involves understanding a wide range of factors that can affect a municipality's financial health. In this paper, we evaluate machine learning models to predict financial distress in Italian municipalities. Accounting judiciary experts have specialized knowledge and experience in evaluating the financial performance of municipalities, and they use a range of financial and general indicators to make their assessments. By incorporating these indicators in the feature extraction process, we can ensure that the predictive model is taking into account a wide range of information that is relevant to the financial health of municipalities. The results of this study indicate that using machine learning models in combination with the knowledge of accounting judiciary experts can aid in the early detection of financial distress in municipalities, leading to better outcomes for the communities they serve.  ( 2 min )
    MSDC: Exploiting Multi-State Power Consumption in Non-intrusive Load Monitoring based on A Dual-CNN Model. (arXiv:2302.05565v1 [cs.LG])
    Non-intrusive load monitoring (NILM) aims to decompose aggregated electrical usage signal into appliance-specific power consumption and it amounts to a classical example of blind source separation tasks. Leveraging recent progress on deep learning techniques, we design a new neural NILM model Multi-State Dual CNN (MSDC). Different from previous models, MSDC explicitly extracts information about the appliance's multiple states and state transitions, which in turn regulates the prediction of signals for appliances. More specifically, we employ a dual-CNN architecture: one CNN for outputting state distributions and the other for predicting the power of each state. A new technique is invented that utilizes conditional random fields (CRF) to capture state transitions. Experiments on two real-world datasets REDD and UK-DALE demonstrate that our model significantly outperform state-of-the-art models while having good generalization capacity, achieving 6%-10% MAE gain and 33%-51% SAE gain to unseen appliances.  ( 2 min )
    Improving Differentiable Architecture Search via Self-Distillation. (arXiv:2302.05629v1 [cs.CV])
    Differentiable Architecture Search (DARTS) is a simple yet efficient Neural Architecture Search (NAS) method. During the search stage, DARTS trains a supernet by jointly optimizing architecture parameters and network parameters. During the evaluation stage, DARTS derives the optimal architecture based on architecture parameters. However, the loss landscape of the supernet is not smooth, and it results in a performance gap between the supernet and the optimal architecture. In the paper, we propose Self-Distillation Differentiable Neural Architecture Search (SD-DARTS) by utilizing self-distillation to transfer knowledge of the supernet in previous steps to guide the training of the supernet in the current steps. SD-DARTS can minimize the loss difference for the two consecutive iterations so that minimize the sharpness of the supernet's loss to bridge the performance gap between the supernet and the optimal architecture. Furthermore, we propose voted teachers, which select multiple previous supernets as teachers and vote teacher output probabilities as the final teacher prediction. The knowledge of several teachers is more abundant than a single teacher, thus, voted teachers can be more suitable to lead the training of the supernet. Experimental results on real datasets illustrate the advantages of our novel self-distillation-based NAS method compared to state-of-the-art alternatives.  ( 2 min )
    Robust Scheduling with GFlowNets. (arXiv:2302.05446v1 [cs.AI])
    Finding the best way to schedule operations in a computation graph is a classical NP-hard problem which is central to compiler optimization. However, evaluating the goodness of a schedule on the target hardware can be very time-consuming. Traditional approaches as well as previous machine learning ones typically optimize proxy metrics, which are fast to evaluate but can lead to bad schedules when tested on the target hardware. In this work, we propose a new approach to scheduling by sampling proportionally to the proxy metric using a novel GFlowNet method. We introduce a technique to control the trade-off between diversity and goodness of the proposed schedules at inference time and demonstrate empirically that the pure optimization baselines can lead to subpar performance with respect to our approach when tested on a target model. Furthermore, we show that conditioning the GFlowNet on the computation graph enables generalization to unseen scheduling problems for both synthetic and real-world compiler datasets.  ( 2 min )
    A novel approach to generate datasets with XAI ground truth to evaluate image models. (arXiv:2302.05624v1 [cs.CV])
    With the increased usage of artificial intelligence (AI), it is imperative to understand how these models work internally. These needs have led to the development of a new field called eXplainable artificial intelligence (XAI). This field consists of on a set of techniques that allows us to theoretically determine the cause of the AI decisions. One unsolved question about XAI is how to measure the quality of explanations. In this study, we propose a new method to generate datasets with ground truth (GT). These datasets allow us to measure how faithful is a method without ad hoc solutions. We conducted a set of experiments that compared our GT with real model explanations and obtained excellent results confirming that our proposed method is correct.  ( 2 min )
    Machine Learning Based Approach to Recommend MITRE ATT&CK Framework for Software Requirements and Design Specifications. (arXiv:2302.05530v1 [cs.SE])
    Engineering more secure software has become a critical challenge in the cyber world. It is very important to develop methodologies, techniques, and tools for developing secure software. To develop secure software, software developers need to think like an attacker through mining software repositories. These aim to analyze and understand the data repositories related to software development. The main goal is to use these software repositories to support the decision-making process of software development. There are different vulnerability databases like Common Weakness Enumeration (CWE), Common Vulnerabilities and Exposures database (CVE), and CAPEC. We utilized a database called MITRE. MITRE ATT&CK tactics and techniques have been used in various ways and methods, but tools for utilizing these tactics and techniques in the early stages of the software development life cycle (SDLC) are lacking. In this paper, we use machine learning algorithms to map requirements to the MITRE ATT&CK database and determine the accuracy of each mapping depending on the data split.  ( 2 min )
    FairPy: A Toolkit for Evaluation of Social Biases and their Mitigation in Large Language Models. (arXiv:2302.05508v1 [cs.CL])
    Studies have shown that large pretrained language models exhibit biases against social groups based on race, gender etc, which they inherit from the datasets they are trained on. Various researchers have proposed mathematical tools for quantifying and identifying these biases. There have been methods proposed to mitigate such biases. In this paper, we present a comprehensive quantitative evaluation of different kinds of biases such as race, gender, ethnicity, age etc. exhibited by popular pretrained language models such as BERT, GPT-2 etc. and also present a toolkit that provides plug-and-play interfaces to connect mathematical tools to identify biases with large pretrained language models such as BERT, GPT-2 etc. and also present users with the opportunity to test custom models against these metrics. The toolkit also allows users to debias existing and custom models using the debiasing techniques proposed so far. The toolkit is available at https://github.com/HrishikeshVish/Fairpy.  ( 2 min )
    Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. (arXiv:2302.05733v1 [cs.CR])
    Recent advances in instruction-following large language models (LLMs) have led to dramatic improvements in a range of NLP tasks. Unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. The capabilities of these instruction-following LLMs provide strong economic incentives for dual-use by malicious actors. In particular, we show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams, bypassing in-the-wild defenses implemented by LLM API vendors. Our analysis shows that this content can be generated economically and at cost likely lower than with human effort alone. Together, our findings suggest that LLMs will increasingly attract more sophisticated adversaries and attacks, and addressing these attacks may require new approaches to mitigations.  ( 2 min )
    ReMIX: Regret Minimization for Monotonic Value Function Factorization in Multiagent Reinforcement Learning. (arXiv:2302.05593v1 [cs.LG])
    Value function factorization methods have become a dominant approach for cooperative multiagent reinforcement learning under a centralized training and decentralized execution paradigm. By factorizing the optimal joint action-value function using a monotonic mixing function of agents' utilities, these algorithms ensure the consistency between joint and local action selections for decentralized decision-making. Nevertheless, the use of monotonic mixing functions also induces representational limitations. Finding the optimal projection of an unrestricted mixing function onto monotonic function classes is still an open problem. To this end, we propose ReMIX, formulating this optimal projection problem for value function factorization as a regret minimization over the projection weights of different state-action values. Such an optimization problem can be relaxed and solved using the Lagrangian multiplier method to obtain the close-form optimal projection weights. By minimizing the resulting policy regret, we can narrow the gap between the optimal and the restricted monotonic mixing functions, thus obtaining an improved monotonic value function factorization. Our experimental results on Predator-Prey and StarCraft Multiagent Challenge environments demonstrate the effectiveness of our method, indicating the better capabilities of handling environments with non-monotonic value functions.  ( 2 min )
    Cross-center Early Sepsis Recognition by Medical Knowledge Guided Collaborative Learning for Data-scarce Hospitals. (arXiv:2302.05702v1 [cs.LG])
    There are significant regional inequities in health resources around the world. It has become one of the most focused topics to improve health services for data-scarce hospitals and promote health equity through knowledge sharing among medical institutions. Because electronic medical records (EMRs) contain sensitive personal information, privacy protection is unavoidable and essential for multi-hospital collaboration. In this paper, for a common disease in ICU patients, sepsis, we propose a novel cross-center collaborative learning framework guided by medical knowledge, SofaNet, to achieve early recognition of this disease. The Sepsis-3 guideline, published in 2016, defines that sepsis can be diagnosed by satisfying both suspicion of infection and Sequential Organ Failure Assessment (SOFA) greater than or equal to 2. Based on this knowledge, SofaNet adopts a multi-channel GRU structure to predict SOFA values of different systems, which can be seen as an auxiliary task to generate better health status representations for sepsis recognition. Moreover, we only achieve feature distribution alignment in the hidden space during cross-center collaborative learning, which ensures secure and compliant knowledge transfer without raw data exchange. Extensive experiments on two open clinical datasets, MIMIC-III and Challenge, demonstrate that SofaNet can benefit early sepsis recognition when hospitals only have limited EMRs.  ( 2 min )
    Satellite Anomaly Detection Using Variance Based Genetic Ensemble of Neural Networks. (arXiv:2302.05525v1 [cs.LG])
    In this paper, we use a variance-based genetic ensemble (VGE) of Neural Networks (NNs) to detect anomalies in the satellite's historical data. We use an efficient ensemble of the predictions from multiple Recurrent Neural Networks (RNNs) by leveraging each model's uncertainty level (variance). For prediction, each RNN is guided by a Genetic Algorithm (GA) which constructs the optimal structure for each RNN model. However, finding the model uncertainty level is challenging in many cases. Although the Bayesian NNs (BNNs)-based methods are popular for providing the confidence bound of the models, they cannot be employed in complex NN structures as they are computationally intractable. This paper uses the Monte Carlo (MC) dropout as an approximation version of BNNs. Then these uncertainty levels and each predictive model suggested by GA are used to generate a new model, which is then used for forecasting the TS and AD. Simulation results show that the forecasting and AD capability of the ensemble model outperforms existing approaches.  ( 2 min )
    Fair Enough: Standardizing Evaluation and Model Selection for Fairness Research in NLP. (arXiv:2302.05711v1 [cs.CL])
    Modern NLP systems exhibit a range of biases, which a growing literature on model debiasing attempts to correct. However current progress is hampered by a plurality of definitions of bias, means of quantification, and oftentimes vague relation between debiasing algorithms and theoretical measures of bias. This paper seeks to clarify the current situation and plot a course for meaningful progress in fair learning, with two key contributions: (1) making clear inter-relations among the current gamut of methods, and their relation to fairness theory; and (2) addressing the practical problem of model selection, which involves a trade-off between fairness and accuracy and has led to systemic issues in fairness research. Putting them together, we make several recommendations to help shape future work.  ( 2 min )
    On Differential Privacy and Adaptive Data Analysis with Bounded Space. (arXiv:2302.05707v1 [cs.CR])
    We study the space complexity of the two related fields of differential privacy and adaptive data analysis. Specifically, (1) Under standard cryptographic assumptions, we show that there exists a problem P that requires exponentially more space to be solved efficiently with differential privacy, compared to the space needed without privacy. To the best of our knowledge, this is the first separation between the space complexity of private and non-private algorithms. (2) The line of work on adaptive data analysis focuses on understanding the number of samples needed for answering a sequence of adaptive queries. We revisit previous lower bounds at a foundational level, and show that they are a consequence of a space bottleneck rather than a sampling bottleneck. To obtain our results, we define and construct an encryption scheme with multiple keys that is built to withstand a limited amount of key leakage in a very particular way.  ( 2 min )
    Cross-domain Random Pre-training with Prototypes for Reinforcement Learning. (arXiv:2302.05614v1 [cs.LG])
    Task-agnostic cross-domain pre-training shows great potential in image-based Reinforcement Learning (RL) but poses a big challenge. In this paper, we propose CRPTpro, a Cross-domain self-supervised Random Pre-Training framework with prototypes for image-based RL. CRPTpro employs cross-domain random policy to easily and quickly sample diverse data from multiple domains, to improve pre-training efficiency. Moreover, prototypical representation learning with a novel intrinsic loss is proposed to pre-train an effective and generic encoder across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream visual-control RL tasks defined in different domains efficiently. Compared with prior arts like APT and Proto-RL, CRPTpro achieves better performance on cross-domain downstream RL tasks without extra training on exploration agents for expert data collection, greatly reducing the burden of pre-training. Experiments on DeepMind Control suite (DMControl) demonstrate that CRPTpro outperforms APT significantly on 11/12 cross-domain RL tasks with only 39% pre-training hours, becoming a state-of-the-art cross-domain pre-training method in both policy learning performance and pre-training efficiency. The complete code will be released at https://github.com/liuxin0824/CRPTpro.  ( 2 min )
    SLOTH: Structured Learning and Task-based Optimization for Time Series Forecasting on Hierarchies. (arXiv:2302.05650v1 [cs.LG])
    Multivariate time series forecasting with hierarchical structure is widely used in real-world applications, e.g., sales predictions for the geographical hierarchy formed by cities, states, and countries. The hierarchical time series (HTS) forecasting includes two sub-tasks, i.e., forecasting and reconciliation. In the previous works, hierarchical information is only integrated in the reconciliation step to maintain coherency, but not in forecasting step for accuracy improvement. In this paper, we propose two novel tree-based feature integration mechanisms, i.e., top-down convolution and bottom-up attention to leverage the information of the hierarchical structure to improve the forecasting performance. Moreover, unlike most previous reconciliation methods which either rely on strong assumptions or focus on coherent constraints only,we utilize deep neural optimization networks, which not only achieve coherency without any assumptions, but also allow more flexible and realistic constraints to achieve task-based targets, e.g., lower under-estimation penalty and meaningful decision-making loss to facilitate the subsequent downstream tasks. Experiments on real-world datasets demonstrate that our tree-based feature integration mechanism achieves superior performances on hierarchical forecasting tasks compared to the state-of-the-art methods, and our neural optimization networks can be applied to real-world tasks effectively without any additional effort under coherence and task-based constraints  ( 2 min )
    Robust Knowledge Transfer in Tiered Reinforcement Learning. (arXiv:2302.05534v1 [cs.LG])
    In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the "Optimal Value Dominance" for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.  ( 2 min )
    Privacy Against Agnostic Inference Attack in Vertical Federated Learning. (arXiv:2302.05545v1 [cs.CR])
    A novel form of inference attack in vertical federated learning (VFL) is proposed, where two parties collaborate in training a machine learning (ML) model. Logistic regression is considered for the VFL model. One party, referred to as the active party, possesses the ground truth labels of the samples in the training phase, while the other, referred to as the passive party, only shares a separate set of features corresponding to these samples. It is shown that the active party can carry out inference attacks on both training and prediction phase samples by acquiring an ML model independently trained on the training samples available to them. This type of inference attack does not require the active party to be aware of the score of a specific sample, hence it is referred to as an agnostic inference attack. It is shown that utilizing the observed confidence scores during the prediction phase, before the time of the attack, can improve the performance of the active party's autonomous model, and thus improve the quality of the agnostic inference attack. As a countermeasure, privacy-preserving schemes (PPSs) are proposed. While the proposed schemes preserve the utility of the VFL model, they systematically distort the VFL parameters corresponding to the passive party's features. The level of the distortion imposed on the passive party's parameters is adjustable, giving rise to a trade-off between privacy of the passive party and interpretabiliy of the VFL outcomes by the active party. The distortion level of the passive party's parameters could be chosen carefully according to the privacy and interpretabiliy concerns of the passive and active parties, respectively, with the hope of keeping both parties (partially) satisfied. Finally, experimental results demonstrate the effectiveness of the proposed attack and the PPSs.  ( 2 min )
    Predicting Participants' Performance in Programming Contests using Deep Learning Techniques. (arXiv:2302.05602v1 [cs.LG])
    In recent days, the number of technology enthusiasts is increasing day by day with the prevalence of technological products and easy access to the internet. Similarly, the amount of people working behind this rapid development is rising tremendously. Computer programmers consist of a large portion of those tech-savvy people. Codeforces, an online programming and contest hosting platform used by many competitive programmers worldwide. It is regarded as one of the most standardized platforms for practicing programming problems and participate in programming contests. In this research, we propose a framework that predicts the performance of any particular contestant in the upcoming competitions as well as predicts the rating after that contest based on their practice and the performance of their previous contests.  ( 2 min )
    PDSum: Prototype-driven Continuous Summarization of Evolving Multi-document Sets Stream. (arXiv:2302.05550v1 [cs.IR])
    Summarizing text-rich documents has been long studied in the literature, but most of the existing efforts have been made to summarize a static and predefined multi-document set. With the rapid development of online platforms for generating and distributing text-rich documents, there arises an urgent need for continuously summarizing dynamically evolving multi-document sets where the composition of documents and sets is changing over time. This is especially challenging as the summarization should be not only effective in incorporating relevant, novel, and distinctive information from each concurrent multi-document set, but also efficient in serving online applications. In this work, we propose a new summarization problem, Evolving Multi-Document sets stream Summarization (EMDS), and introduce a novel unsupervised algorithm PDSum with the idea of prototype-driven continuous summarization. PDSum builds a lightweight prototype of each multi-document set and exploits it to adapt to new documents while preserving accumulated knowledge from previous documents. To update new summaries, the most representative sentences for each multi-document set are extracted by measuring their similarities to the prototypes. A thorough evaluation with real multi-document sets streams demonstrates that PDSum outperforms state-of-the-art unsupervised multi-document summarization algorithms in EMDS in terms of relevance, novelty, and distinctiveness and is also robust to various evaluation settings.  ( 2 min )
    Brain Effective Connectome based on fMRI and DTI Data: Bayesian Causal Learning and Assessment. (arXiv:2302.05451v1 [cs.LG])
    The ambitious goal of neuroscientific studies is to find an accurate and reliable brain Effective Connectome (EC). Although current EC discovery methods have contributed to our understanding of brain organization, their performances are severely constrained by the short sample size and poor temporal resolution of fMRI data, and high dimensionality of the brain connectome. By leveraging the DTI data as prior knowledge, we introduce two Bayesian casual discovery frameworks -- the Bayesian GOLEM (BGOLEM) and Bayesian FGES (BFGES) methods -- as the most reliable and accurate methods in discovering EC that address the shortcomings of the current causal discovery methods in discovering ECs based on only fMRI data. Through a series of simulation studies on synthetic and hybrid (DTI of the Human Connectome Project (HCP) subjects and synthetic fMRI) data, we first demonstrate the effectiveness and importance of the proposed methods in discovering EC. We also introduce the Pseudo False Discovery Rate (PFDR) as a new accuracy metric for causal discovery in the brain and show that our Bayesian methods achieve higher accuracy than traditional methods on empirical data (DTI and fMRI of the Human Connectome Project (HCP) subjects). Additionally, we measure the reliability of discovered ECs using the Rogers-Tanimoto index for test-retest data and show that our Bayesian methods provide significantly more reproducible ECs compared to traditional methods.  ( 2 min )
    CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. (arXiv:2302.05527v1 [cs.SE])
    Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.  ( 2 min )
    Achieving acceleration despite very noisy gradients. (arXiv:2302.05515v1 [stat.ML])
    We present a novel momentum-based first order optimization method (AGNES) which provably achieves acceleration for convex minimization, even if the stochastic noise in the gradient estimates is many orders of magnitude larger than the gradient itself. Here we model the noise as having a variance which is proportional to the magnitude of the underlying gradient. We argue, based upon empirical evidence, that this is appropriate for mini-batch gradients in overparameterized deep learning. Furthermore, we demonstrate that the method achieves competitive performance in the training of CNNs on MNIST and CIFAR-10.  ( 2 min )
  • Open

    Which Invariance Should We Transfer? A Causal Minimax Learning Approach. (arXiv:2107.01876v3 [stat.ML] UPDATED)
    A major barrier to deploying current machine learning models lies in their non-reliability to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Particularly, independent causal mechanisms-based methods proposed to remove mutable causal mechanisms via the do-operator. Compared to previous methods, the obtained stable predictors are more effective in identifying stable information. However, a key question remains: which subset of this whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we present a comprehensive minimax analysis from a causal perspective. Specifically, we first provide a graphical condition for the whole stable set to be optimal. When this condition fails, we surprisingly find with an example that this whole stable set, although can fully exploit stable information, is not the optimal one to transfer. To identify the optimal subset under this case, we propose to estimate the worst-case risk with a novel optimization scheme over the intervention functions on mutable causal mechanisms. We then propose an efficient algorithm to search for the subset with minimal worst-case risk, based on a newly defined equivalence relation between stable subsets. Compared to the exponential cost of exhaustively searching over all subsets, our searching strategy enjoys a polynomial complexity. The effectiveness and efficiency of our methods are demonstrated on synthetic data and the diagnosis of Alzheimer's disease.  ( 2 min )
    Beyond UCB: Statistical Complexity and Optimal Algorithms for Non-linear Ridge Bandits. (arXiv:2302.06025v1 [stat.ML])
    We consider the sequential decision-making problem where the mean outcome is a non-linear function of the chosen action. Compared with the linear model, two curious phenomena arise in non-linear models: first, in addition to the "learning phase" with a standard parametric rate for estimation or regret, there is an "burn-in period" with a fixed cost determined by the non-linear function; second, achieving the smallest burn-in cost requires new exploration algorithms. For a special family of non-linear functions named ridge functions in the literature, we derive upper and lower bounds on the optimal burn-in cost, and in addition, on the entire learning trajectory during the burn-in period via differential equations. In particular, a two-stage algorithm that first finds a good initial action and then treats the problem as locally linear is statistically optimal. In contrast, several classical algorithms, such as UCB and algorithms relying on regression oracles, are provably suboptimal.  ( 2 min )
    Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret. (arXiv:2205.12418v3 [cs.LG] UPDATED)
    We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.  ( 2 min )
    Chaotic Hedging with Iterated Integrals and Neural Networks. (arXiv:2209.10166v2 [q-fin.MF] UPDATED)
    In this paper, we extend the Wiener-Ito chaos decomposition to the class of diffusion processes, whose drift and diffusion coefficient are of linear growth. By omitting the orthogonality in the chaos expansion, we are able to show that every $p$-integrable functional, for $p \in [1,\infty)$, can be represented as sum of iterated integrals of the underlying process. Using a truncated sum of this expansion and (possibly random) neural networks for the integrands, whose parameters are learned in a machine learning setting, we show that every financial derivative can be approximated arbitrarily well in the $L^p$-sense. Since the hedging strategy of the approximating option can be computed in closed form, we obtain an efficient algorithm that can replicate any integrable financial derivative with short runtime.  ( 2 min )
    Do PAC-Learners Learn the Marginal Distribution?. (arXiv:2302.06285v1 [cs.LG])
    We study a foundational variant of Valiant and Vapnik and Chervonenkis' Probably Approximately Correct (PAC)-Learning in which the adversary is restricted to a known family of marginal distributions $\mathscr{P}$. In particular, we study how the PAC-learnability of a triple $(\mathscr{P},X,H)$ relates to the learners ability to infer \emph{distributional} information about the adversary's choice of $D \in \mathscr{P}$. To this end, we introduce the `unsupervised' notion of \emph{TV-Learning}, which, given a class $(\mathscr{P},X,H)$, asks the learner to approximate $D$ from unlabeled samples with respect to a natural class-conditional total variation metric. In the classical distribution-free setting, we show that TV-learning is \emph{equivalent} to PAC-Learning: in other words, any learner must infer near-maximal information about $D$. On the other hand, we show this characterization breaks down for general $\mathscr{P}$, where PAC-Learning is strictly sandwiched between two approximate variants we call `Strong' and `Weak' TV-learning, roughly corresponding to unsupervised learners that estimate most relevant distances in $D$ with respect to $H$, but differ in whether the learner \emph{knows} the set of well-estimated events. Finally, we observe that TV-learning is in fact equivalent to the classical notion of \emph{uniform estimation}, and thereby give a strong refutation of the uniform convergence paradigm in supervised learning.  ( 2 min )
    Hierarchical Stochastic Block Model for Community Detection in Multiplex Networks. (arXiv:1904.05330v3 [cs.SI] UPDATED)
    Multiplex networks have become increasingly more prevalent in many fields, and have emerged as a powerful tool for modeling the complexity of real networks. There is a critical need for developing inference models for multiplex networks that can take into account potential dependencies across different layers, particularly when the aim is community detection. We add to a limited literature by proposing a novel and efficient Bayesian model for community detection in multiplex networks. A key feature of our approach is the ability to model varying communities at different network layers. In contrast, many existing models assume the same communities for all layers. Moreover, our model automatically picks up the necessary number of communities at each layer (as validated by real data examples). This is appealing, since deciding the number of communities is a challenging aspect of community detection, and especially so in the multiplex setting, if one allows the communities to change across layers. Borrowing ideas from hierarchical Bayesian modeling, we use a hierarchical Dirichlet prior to model community labels across layers, allowing dependency in their structure. Given the community labels, a stochastic block model (SBM) is assumed for each layer. We develop an efficient slice sampler for sampling the posterior distribution of the community labels as well as the link probabilities between communities. In doing so, we address some unique challenges posed by coupling the complex likelihood of SBM with the hierarchical nature of the prior on the labels. An extensive empirical validation is performed on simulated and real data, demonstrating the superior performance of the model over single-layer alternatives, as well as the ability to uncover interesting structures in real networks.  ( 3 min )
    Information-Directed Selection for Top-Two Algorithms. (arXiv:2205.12086v2 [stat.ML] UPDATED)
    We consider the best-k-arm identification problem for multi-armed bandits, where the objective is to select the exact set of k arms with the highest mean rewards by sequentially allocating measurement effort. We characterize the necessary and sufficient conditions for the optimal allocation using dual variables. Remarkably these optimality conditions lead to the extension of top-two algorithm design principle (Russo, 2020), initially proposed for best-arm identification. Furthermore, our optimality conditions induce a simple and effective selection rule dubbed information-directed selection (IDS) that selects one of the top-two candidates based on a measure of information gain. As a theoretical guarantee, we prove that integrated with IDS, top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a glaring open problem in the pure exploration literature (Russo, 2020). As a by-product, we show that for k > 1, top-two algorithms cannot achieve optimality even with an oracle tuning parameter. Numerical experiments show the superior performance of the proposed top-two algorithms with IDS and considerable improvement compared with algorithms without adaptive selection.  ( 2 min )
    A Graphical Point Process Framework for Understanding Removal Effects in Multi-Touch Attribution. (arXiv:2302.06075v1 [stat.ME])
    Marketers employ various online advertising channels to reach customers, and they are particularly interested in attribution for measuring the degree to which individual touchpoints contribute to an eventual conversion. The availability of individual customer-level path-to-purchase data and the increasing number of online marketing channels and types of touchpoints bring new challenges to this fundamental problem. We aim to tackle the attribution problem with finer granularity by conducting attribution at the path level. To this end, we develop a novel graphical point process framework to study the direct conversion effects and the full relational structure among numerous types of touchpoints simultaneously. Utilizing the temporal point process of conversion and the graphical structure, we further propose graphical attribution methods to allocate proper path-level conversion credit, called the attribution score, to individual touchpoints or corresponding channels for each customer's path to purchase. Our proposed attribution methods consider the attribution score as the removal effect, and we use the rigorous probabilistic definition to derive two types of removal effects. We examine the performance of our proposed methods in extensive simulation studies and compare their performance with commonly used attribution models. We also demonstrate the performance of the proposed methods in a real-world attribution application.  ( 2 min )
    Alternating Implicit Projected SGD and Its Efficient Variants for Equality-constrained Bilevel Optimization. (arXiv:2211.07096v2 [cs.LG] UPDATED)
    Stochastic bilevel optimization, which captures the inherent nested structure of machine learning problems, is gaining popularity in many recent applications. Existing works on bilevel optimization mostly consider either unconstrained problems or constrained upper-level problems. This paper considers the stochastic bilevel optimization problems with equality constraints both in the upper and lower levels. By leveraging the special structure of the equality constraints problem, the paper first presents an alternating implicit projected SGD approach and establishes the $\tilde{\cal O}(\epsilon^{-2})$ sample complexity that matches the state-of-the-art complexity of ALSET \citep{chen2021closing} for unconstrained bilevel problems. To further save the cost of projection, the paper presents two alternating implicit projection-efficient SGD approaches, where one algorithm enjoys the $\tilde{\cal O}(\epsilon^{-2}/T)$ upper-level and $\tilde{\cal O}(\epsilon^{-1.5}/T^{\frac{3}{4}})$ lower-level projection complexity with ${\cal O}(T)$ lower-level batch size, and the other one enjoys $\tilde{\cal O}(\epsilon^{-1.5})$ upper-level and lower-level projection complexity with ${\cal O}(1)$ batch size. Application to federated bilevel optimization has been presented to showcase the empirical performance of our algorithms. Our results demonstrate that equality-constrained bilevel optimization with strongly-convex lower-level problems can be solved as efficiently as stochastic single-level optimization problems.  ( 2 min )
    Breaking the Curse of Multiagency: Provably Efficient Decentralized Multi-Agent RL with Function Approximation. (arXiv:2302.06606v1 [cs.LG])
    A unique challenge in Multi-Agent Reinforcement Learning (MARL) is the curse of multiagency, where the description length of the game as well as the complexity of many existing learning algorithms scale exponentially with the number of agents. While recent works successfully address this challenge under the model of tabular Markov Games, their mechanisms critically rely on the number of states being finite and small, and do not extend to practical scenarios with enormous state spaces where function approximation must be used to approximate value functions or policies. This paper presents the first line of MARL algorithms that provably resolve the curse of multiagency under function approximation. We design a new decentralized algorithm -- V-Learning with Policy Replay, which gives the first polynomial sample complexity results for learning approximate Coarse Correlated Equilibria (CCEs) of Markov Games under decentralized linear function approximation. Our algorithm always outputs Markov CCEs, and achieves an optimal rate of $\widetilde{\mathcal{O}}(\epsilon^{-2})$ for finding $\epsilon$-optimal solutions. Also, when restricted to the tabular case, our result improves over the current best decentralized result $\widetilde{\mathcal{O}}(\epsilon^{-3})$ for finding Markov CCEs. We further present an alternative algorithm -- Decentralized Optimistic Policy Mirror Descent, which finds policy-class-restricted CCEs using a polynomial number of samples. In exchange for learning a weaker version of CCEs, this algorithm applies to a wider range of problems under generic function approximation, such as linear quadratic games and MARL problems with low ''marginal'' Eluder dimension.  ( 2 min )
    Event-Triggered Time-Varying Bayesian Optimization. (arXiv:2208.10790v2 [cs.LG] UPDATED)
    We consider the problem of sequentially optimizing a time-varying objective function using time-varying Bayesian optimization (TVBO). Here, the key challenge is the exploration-exploitation trade-off under time variations. Current approaches to TVBO require prior knowledge of a constant rate of change. However, the rate of change is usually neither known nor constant. We propose an event-triggered algorithm, ET-GP-UCB, that treats the optimization problem as static until it detects changes in the objective function online and then resets the dataset. This allows the algorithm to adapt to realized temporal changes without the need for prior knowledge. The event-trigger is based on probabilistic uniform error bounds used in Gaussian process regression. We provide regret bounds for ET-GP-UCB and show in numerical experiments that it is competitive with state-of-the-art algorithms even though it requires no knowledge about the temporal changes. Further, ET-GP-UCB outperforms these baselines if the rate of change is misspecified, and we demonstrate that it is readily applicable to various settings without tuning hyperparameters.  ( 2 min )
    Dark solitons in Bose-Einstein condensates: a dataset for many-body physics research. (arXiv:2205.09114v2 [cond-mat.quant-gas] UPDATED)
    We establish a dataset of over $1.6\times10^4$ experimental images of Bose--Einstein condensates containing solitonic excitations to enable machine learning (ML) for many-body physics research. About $33~\%$ of this dataset has manually assigned and carefully curated labels. The remainder is automatically labeled using SolDet -- an implementation of a physics-informed ML data analysis framework -- consisting of a convolutional-neural-network-based classifier and OD as well as a statistically motivated physics-informed classifier and a quality metric. This technical note constitutes the definitive reference of the dataset, providing an opportunity for the data science community to develop more sophisticated analysis tools, to further understand nonlinear many-body physics, and even advance cold atom experiments.  ( 2 min )
    A Finite-Particle Convergence Rate for Stein Variational Gradient Descent. (arXiv:2211.09721v3 [cs.LG] UPDATED)
    We provide a first finite-particle convergence rate for Stein variational gradient descent (SVGD). Specifically, whenever the target distribution is sub-Gaussian with a Lipschitz score, SVGD with n particles and an appropriate step size sequence drives the kernel Stein discrepancy to zero at an order 1/sqrt(log log n) rate. We suspect that the dependence on n can be improved, and we hope that our explicit, non-asymptotic proof strategy will serve as a template for future refinements.  ( 2 min )
    A Characterization of Multioutput Learnability. (arXiv:2301.02729v2 [cs.LG] UPDATED)
    We consider the problem of learning multioutput function classes in batch and online settings. In both settings, we show that a multioutput function class is learnable if and only if each single-output restriction of the function class is learnable. This provides a complete characterization of the learnability of multilabel classification and multioutput regression in both batch and online settings. As an extension, we also consider multilabel learnability in the bandit feedback setting and show a similar characterization as in the full-feedback setting.  ( 2 min )
    A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks. (arXiv:2001.11443v3 [cs.LG] UPDATED)
    We develop a mathematically rigorous framework for multilayer neural networks in the mean field regime. As the network's widths increase, the network's learning trajectory is shown to be well captured by a meaningful and dynamically nonlinear limit (the \textit{mean field} limit), which is characterized by a system of ODEs. Our framework applies to a broad range of network architectures, learning dynamics and network initializations. Central to the framework is the new idea of a \textit{neuronal embedding}, which comprises of a non-evolving probability space that allows to embed neural networks of arbitrary widths. Using our framework, we prove several properties of large-width multilayer neural networks. Firstly we show that independent and identically distributed initializations cause strong degeneracy effects on the network's learning trajectory when the network's depth is at least four. Secondly we obtain several global convergence guarantees for feedforward multilayer networks under a number of different setups. These include two-layer and three-layer networks with independent and identically distributed initializations, and multilayer networks of arbitrary depths with a special type of correlated initializations that is motivated by the new concept of \textit{bidirectional diversity}. Unlike previous works that rely on convexity, our results admit non-convex losses and hinge on a certain universal approximation property, which is a distinctive feature of infinite-width neural networks and is shown to hold throughout the training process. Aside from being the first known results for global convergence of multilayer networks in the mean field regime, they demonstrate flexibility of our framework and incorporate several new ideas and insights that depart from the conventional convex optimization wisdom.  ( 3 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v6 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    On the geometry of Stein variational gradient descent. (arXiv:1912.00894v2 [stat.ML] UPDATED)
    Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performance gains of these in various numerical experiments.  ( 2 min )
    A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity. (arXiv:2302.06015v1 [cs.LG])
    Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.  ( 2 min )
    Improving Accuracy of Interpretability Measures in Hyperparameter Optimization via Bayesian Algorithm Execution. (arXiv:2206.05447v2 [cs.LG] UPDATED)
    Despite all the benefits of automated hyperparameter optimization (HPO), most modern HPO algorithms are black-boxes themselves. This makes it difficult to understand the decision process which leads to the selected configuration, reduces trust in HPO, and thus hinders its broad adoption. Here, we study the combination of HPO with interpretable machine learning (IML) methods such as partial dependence plots. These techniques are more and more used to explain the marginal effect of hyperparameters on the black-box cost function or to quantify the importance of hyperparameters. However, if such methods are naively applied to the experimental data of the HPO process in a post-hoc manner, the underlying sampling bias of the optimizer can distort interpretations. We propose a modified HPO method which efficiently balances the search for the global optimum w.r.t. predictive performance \emph{and} the reliable estimation of IML explanations of an underlying black-box function by coupling Bayesian optimization and Bayesian Algorithm Execution. On benchmark cases of both synthetic objectives and HPO of a neural network, we demonstrate that our method returns more reliable explanations of the underlying black-box without a loss of optimization performance.  ( 2 min )
    Transformers in Time Series: A Survey. (arXiv:2202.07125v4 [cs.LG] UPDATED)
    Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interest in the time series community. Among multiple advantages of Transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications. In this paper, we systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series Transformers in two perspectives. From the perspective of network structure, we summarize the adaptations and modifications that have been made to Transformers in order to accommodate the challenges in time series analysis. From the perspective of applications, we categorize time series Transformers based on common tasks including forecasting, anomaly detection, and classification. Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how Transformers perform in time series. Finally, we discuss and suggest future directions to provide useful research guidance. A corresponding resource that has been continuously updated can be found in the GitHub repository. To the best of our knowledge, this paper is the first work to comprehensively and systematically summarize the recent advances of Transformers for modeling time series data. We hope this survey will ignite further research interests in time series Transformers.  ( 2 min )
    Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. (arXiv:2206.03826v5 [cs.LG] UPDATED)
    For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.  ( 2 min )
    A Framework for Overparameterized Learning. (arXiv:2205.13507v2 [cs.LG] UPDATED)
    A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.  ( 2 min )
    On the equivalence between graph isomorphism testing and function approximation with GNNs. (arXiv:1905.12560v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved much success on graph-structured data. In light of this, there have been increasing interests in studying their expressive power. One line of work studies the capability of GNNs to approximate permutation-invariant functions on graphs, and another focuses on the their power as tests for graph isomorphism. Our work connects these two perspectives and proves their equivalence. We further develop a framework of the expressive power of GNNs that incorporates both of these viewpoints using the language of sigma-algebra, through which we compare the expressive power of different types of GNNs together with other graph isomorphism tests. In particular, we prove that the second-order Invariant Graph Network fails to distinguish non-isomorphic regular graphs with the same degree. Then, we extend it to a new architecture, Ring-GNN, which succeeds in distinguishing these graphs and achieves good performances on real-world datasets.  ( 2 min )
    Probabilistic Estimation of Instantaneous Frequencies of Chirp Signals. (arXiv:2205.06306v2 [stat.ML] UPDATED)
    We present a continuous-time probabilistic approach for estimating the chirp signal and its instantaneous frequency function when the true forms of these functions are not accessible. Our model represents these functions by non-linearly cascaded Gaussian processes represented as non-linear stochastic differential equations. The posterior distribution of the functions is then estimated with stochastic filters and smoothers. We compute a (posterior) Cram\'er--Rao lower bound for the Gaussian process model, and derive a theoretical upper bound for the estimation error in the mean squared sense. The experiments show that the proposed method outperforms a number of state-of-the-art methods on a synthetic data. We also show that the method works out-of-the-box for two real-world datasets.  ( 2 min )
    Out-of-distribution Generalization in the Presence of Nuisance-Induced Spurious Correlations. (arXiv:2107.00520v5 [cs.LG] UPDATED)
    In many prediction problems, spurious correlations are induced by a changing relationship between the label and a nuisance variable that is also correlated with the covariates. For example, in classifying animals in natural images, the background, which is a nuisance, can predict the type of animal. This nuisance-label relationship does not always hold, and the performance of a model trained under one such relationship may be poor on data with a different nuisance-label relationship. To build predictive models that perform well regardless of the nuisance-label relationship, we develop Nuisance-Randomized Distillation (NURD). We introduce the nuisance-randomized distribution, a distribution where the nuisance and the label are independent. Under this distribution, we define the set of representations such that conditioning on any member, the nuisance and the label remain independent. We prove that the representations in this set always perform better than chance, while representations outside of this set may not. NURD finds a representation from this set that is most informative of the label under the nuisance-randomized distribution, and we prove that this representation achieves the highest performance regardless of the nuisance-label relationship. We evaluate NURD on several tasks including chest X-ray classification where, using non-lung patches as the nuisance, NURD produces models that predict pneumonia under strong spurious correlations.  ( 2 min )
    Blessing of Class Diversity in Pre-training. (arXiv:2209.03447v3 [cs.LG] UPDATED)
    This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tilde{\nu}$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tilde{\nu} \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.  ( 2 min )
    Plasticity Neural Network Based on Astrocytic effects at Critical Period, Synaptic Competition and Strength Rebalance by Current and Mnemonic Brain Plasticity and Synapse Formation. (arXiv:2203.11740v9 [cs.NE] UPDATED)
    In addition to the weights of synaptic shared connections, PNN includes weights of synaptic effective ranges [14-24]. PNN considers synaptic strength balance in dynamic of phagocytosing of synapses and static of constant sum of synapses length [14], and includes the lead behavior of the school of fish. Synapse formation will inhibit dendrites generation to a certain extent in experiments and PNN simulations [15]. The memory persistence gradient of retrograde circuit similar to the Enforcing Resilience in a Spring Boot. The relatively good and inferior gradient information stored in memory engram cells in synapse formation of retrograde circuit like the folds of the brain [16]. The controversy was claimed if human hippocampal neurogenesis persists throughout aging, PNN considered it may have a new and longer circuit in late iteration [17,18]. Closing the critical period will cause neurological disorder in experiments and PNN simulations [19]. Considering both negative and positive memories persistence help activate synapse length changes with iterations better than only considering positive memory [20]. Astrocytic phagocytosis will avoid the local accumulation of synapses by simulation, Lack of astrocytic phagocytosis causes excitatory synapses and functionally impaired synapses accumulate in experiments and lead to destruction of cognition, but local longer synapses and worse results in PNN simulations [21]. It gives relationship of intelligence and cortical thickness, individual differences in brain [22]. The PNN also considered the memory engram cells that strengthened Synaptic strength [23]. The effects of PNN's memory structure and tPBM may be the same for powerful penetrability of signals [24]. Memory persistence also inhibit local synaptic accumulation. By PNN, it may introduce the relatively good and inferior solution in PSO. The simple PNN only has the synaptic phagocytosis.  ( 3 min )
    Fixed points of nonnegative neural networks. (arXiv:2106.16239v6 [stat.ML] UPDATED)
    We consider the existence of fixed points of nonnegative neural networks, i.e., neural networks that take as an input and produce as an output nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general types of nonnegative neural networks. The results of this paper contribute to the understanding of the behavior of autoencoders, and they provide insight into neural networks designed using the loop-unrolling technique, which can be seen as a fixed point searching algorithm. The chief theoretical results of this paper are verified in numerical simulations.  ( 2 min )
    Generalization Ability of Wide Neural Networks on $\mathbb{R}$. (arXiv:2302.05933v1 [stat.ML])
    We perform a study on the generalization ability of the wide two-layer ReLU neural network on $\mathbb{R}$. We first establish some spectral properties of the neural tangent kernel (NTK): $a)$ $K_{d}$, the NTK defined on $\mathbb{R}^{d}$, is positive definite; $b)$ $\lambda_{i}(K_{1})$, the $i$-th largest eigenvalue of $K_{1}$, is proportional to $i^{-2}$. We then show that: $i)$ when the width $m\rightarrow\infty$, the neural network kernel (NNK) uniformly converges to the NTK; $ii)$ the minimax rate of regression over the RKHS associated to $K_{1}$ is $n^{-2/3}$; $iii)$ if one adopts the early stopping strategy in training a wide neural network, the resulting neural network achieves the minimax rate; $iv)$ if one trains the neural network till it overfits the data, the resulting neural network can not generalize well. Finally, we provide an explanation to reconcile our theory and the widely observed ``benign overfitting phenomenon''.  ( 2 min )
    On Proper Learnability between Average- and Worst-case Robustness. (arXiv:2211.05656v4 [cs.LG] UPDATED)
    Recently, \cite{montasser2019vc} showed that finite VC dimension is not sufficient for \textit{proper} adversarially robust PAC learning. In light of this hardness result, there is a growing effort to study what type of relaxations to the adversarially robust PAC learning setup can enable proper learnability. In this work, we initiate the study of proper learning under relaxations of the worst-case robust loss. We give a family of robust loss relaxations under which VC classes are properly PAC learning with sample complexity close to what one would require in the standard PAC learning setup. On the other hand, we show that for an existing and natural relaxation of the worst-case robust loss, finite VC dimension is not sufficient for proper learning. Lastly, we give new generalization guarantees for the adversarially robust empirical risk minimizer.  ( 2 min )
    Calibrated Forecasts: The Minimax Proof. (arXiv:2209.05863v2 [econ.TH] UPDATED)
    A formal write-up of the simple proof (1995) of the existence of calibrated forecasts by the minimax theorem, which moreover shows that $N^3$ periods suffice to guarantee a calibration error of at most $1/N$.  ( 2 min )
    LIMEtree: Consistent and Faithful Surrogate Explanations of Multiple Classes. (arXiv:2005.01427v2 [cs.LG] UPDATED)
    Explainable machine learning provides tools to better understand predictive models and their decisions, but many such methods are limited to producing insights with respect to a single class. When generating explanations for several classes, reasoning over them to obtain a complete view may be difficult since they can present competing or contradictory evidence. To address this issue we introduce a novel paradigm of multi-class explanations. We outline the theory behind such techniques and propose a local surrogate model based on multi-output regression trees -- called LIMEtree -- which offers faithful and consistent explanations of multiple classes for individual predictions while being post-hoc, model-agnostic and data-universal. In addition to strong fidelity guarantees, our implementation supports (interactive) customisation of the explanatory insights and delivers a range of diverse explanation types, including counterfactual statements favoured in the literature. We evaluate our algorithm with a collection of quantitative experiments, a qualitative analysis based on explainability desiderata and a preliminary user study on an image classification task, comparing it to LIME. Our contributions demonstrate the benefits of multi-class explanations and wide-ranging advantages of our method across a diverse set scenarios.  ( 2 min )
    GFlowNet-EM for learning compositional latent variable models. (arXiv:2302.06576v1 [cs.LG])
    Latent variable models (LVMs) with discrete compositional latents are an important but challenging setting due to a combinatorially large number of possible configurations of the latents. A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. For algorithms based on expectation-maximization (EM), the E-step is often intractable without restrictive approximations to the posterior. We propose the use of GFlowNets, algorithms for sampling from an unnormalized density by learning a stochastic policy for sequential construction of samples, for this intractable E-step. By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational inference algorithms for complex distributions over discrete structures. Our approach, GFlowNet-EM, enables the training of expressive LVMs with discrete compositional latents, as shown by experiments on non-context-free grammar induction and on images using discrete variational autoencoders (VAEs) without conditional independence enforced in the encoder.  ( 2 min )
    Universal Online Optimization in Dynamic Environments via Uniclass Prediction. (arXiv:2302.06066v1 [cs.LG])
    Recently, several universal methods have been proposed for online convex optimization which can handle convex, strongly convex and exponentially concave cost functions simultaneously. However, most of these algorithms have been designed with static regret minimization in mind, but this notion of regret may not be suitable for changing environments. To address this shortcoming, we propose a novel and intuitive framework for universal online optimization in dynamic environments. Unlike existing universal algorithms, our strategy does not rely on the construction of a set of experts and an accompanying meta-algorithm. Instead, we show that the problem of dynamic online optimization can be reduced to a uniclass prediction problem. By leaving the choice of uniclass loss function in the user's hands, they are able to control and optimize dynamic regret bounds, which in turn carry over into the original problem. To the best of our knowledge, this is the first paper proposing a universal approach with state-of-the-art dynamic regret guarantees even for general convex cost functions.  ( 2 min )
    A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models. (arXiv:2302.06235v1 [cs.LG])
    Contrastively trained text-image models have the remarkable ability to perform zero-shot classification, that is, classifying previously unseen images into categories that the model has never been explicitly trained to identify. However, these zero-shot classifiers need prompt engineering to achieve high accuracy. Prompt engineering typically requires hand-crafting a set of prompts for individual downstream tasks. In this work, we aim to automate this prompt engineering and improve zero-shot accuracy through prompt ensembling. In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?". We demonstrate that this is possible. In doing so, we identify several pathologies in a naive prompt scoring method where the score can be easily overconfident due to biases in pre-training and test data, and we propose a novel prompt scoring method that corrects for the biases. Using our proposed scoring method to create a weighted average prompt ensemble, our method outperforms equal average ensemble, as well as hand-crafted prompts, on ImageNet, 4 of its variants, and 11 fine-grained classification benchmarks, all while being fully automatic, optimization-free, and not requiring access to labeled validation data.  ( 2 min )
    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares. (arXiv:2206.01274v3 [stat.ML] UPDATED)
    Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has been empirically illustrated that the relation between heavy tails and generalization might not always be monotonic in practice, contrary to the conclusions of existing theory. In this study, we establish novel links between the tail behavior and generalization properties of stochastic gradient descent (SGD), through the lens of algorithmic stability. We consider a quadratic optimization problem and use a heavy-tailed stochastic differential equation (and its Euler discretization) as a proxy for modeling the heavy-tailed behavior emerging in SGD. We then prove uniform stability bounds, which reveal the following outcomes: (i) Without making any exotic assumptions, we show that SGD will not be stable if the stability is measured with the squared-loss $x\mapsto x^2$, whereas it in turn becomes stable if the stability is instead measured with a surrogate loss $x\mapsto |x|^p$ with some $p<2$. (ii) Depending on the variance of the data, there exists a \emph{`threshold of heavy-tailedness'} such that the generalization error decreases as the tails become heavier, as long as the tails are lighter than this threshold. This suggests that the relation between heavy tails and generalization is not globally monotonic. (iii) We prove matching lower-bounds on uniform stability, implying that our bounds are tight in terms of the heaviness of the tails. We support our theory with synthetic and real neural network experiments.  ( 2 min )
    Recursive Estimation of Conditional Kernel Mean Embeddings. (arXiv:2302.05955v1 [stat.ML])
    Kernel mean embeddings, a widely used technique in machine learning, map probability distributions to elements of a reproducing kernel Hilbert space (RKHS). For supervised learning problems, where input-output pairs are observed, the conditional distribution of outputs given the inputs is a key object. The input dependent conditional distribution of an output can be encoded with an RKHS valued function, the conditional kernel mean map. In this paper we present a new recursive algorithm to estimate the conditional kernel mean map in a Hilbert space valued $L_2$ space, that is in a Bochner space. We prove the weak and strong $L_2$ consistency of our recursive estimator under mild conditions. The idea is to generalize Stone's theorem for Hilbert space valued regression in a locally compact Polish space. We present new insights about conditional kernel mean embeddings and give strong asymptotic bounds regarding the convergence of the proposed recursive method. Finally, the results are demonstrated on three application domains: for inputs coming from Euclidean spaces, Riemannian manifolds and locally compact subsets of function spaces.  ( 2 min )
    Variational Bayesian Neural Networks via Resolution of Singularities. (arXiv:2302.06035v1 [stat.ML])
    In this work, we advocate for the importance of singular learning theory (SLT) as it pertains to the theory and practice of variational inference in Bayesian neural networks (BNNs). To begin, using SLT, we lay to rest some of the confusion surrounding discrepancies between downstream predictive performance measured via e.g., the test log predictive density, and the variational objective. Next, we use the SLT-corrected asymptotic form for singular posterior distributions to inform the design of the variational family itself. Specifically, we build upon the idealized variational family introduced in \citet{bhattacharya_evidence_2020} which is theoretically appealing but practically intractable. Our proposal takes shape as a normalizing flow where the base distribution is a carefully-initialized generalized gamma. We conduct experiments comparing this to the canonical Gaussian base distribution and show improvements in terms of variational free energy and variational generalization error.  ( 2 min )
    Precise Asymptotic Analysis of Deep Random Feature Models. (arXiv:2302.06210v1 [stat.ML])
    We provide exact asymptotic expressions for the performance of regression by an $L-$layer deep random feature (RF) model, where the input is mapped through multiple random embedding and non-linear activation functions. For this purpose, we establish two key steps: First, we prove a novel universality result for RF models and deterministic data, by which we demonstrate that a deep random feature model is equivalent to a deep linear Gaussian model that matches it in the first and second moments, at each layer. Second, we make use of the convex Gaussian Min-Max theorem multiple times to obtain the exact behavior of deep RF models. We further characterize the variation of the eigendistribution in different layers of the equivalent Gaussian model, demonstrating that depth has a tangible effect on model performance despite the fact that only the last layer of the model is being trained.  ( 2 min )
    Deep Reinforcement Learning for Unmanned Aerial Vehicle-Assisted Vehicular Networks. (arXiv:1906.05015v11 [cs.LG] UPDATED)
    Unmanned aerial vehicles (UAVs) are envisioned to complement the 5G communication infrastructure in future smart cities. Hot spots easily appear in road intersections, where effective communication among vehicles is challenging. UAVs may serve as relays with the advantages of low price, easy deployment, line-of-sight links, and flexible mobility. In this paper, we study a UAV-assisted vehicular network where the UAV jointly adjusts its transmission control (power and channel) and 3D flight to maximize the total throughput. First, we formulate a Markov decision process (MDP) problem by modeling the mobility of the UAV/vehicles and the state transitions. Secondly, we solve the target problem using a deep reinforcement learning method, namely, the deep deterministic policy gradient (DDPG), and propose three solutions with different control objectives. Deep reinforcement learning methods obtain the optimal policy through the interactions with the environment without knowing the environment variables. Considering that environment variables in our problem are unknown and unmeasurable, we choose a deep reinforcement learning method to solve it. Moreover, considering the energy consumption of 3D flight, we extend the proposed solutions to maximize the total throughput per unit energy. To encourage or discourage the UAV's mobility according to its prediction, the DDPG framework is modified, where the UAV adjusts its learning rate automatically. Thirdly, in a simplified model with small state space and action space, we verify the optimality of proposed algorithms. Comparing with two baseline schemes, we demonstrate the effectiveness of proposed algorithms in a realistic model.  ( 3 min )
    Wide stochastic networks: Gaussian limit and PAC-Bayesian training. (arXiv:2106.09798v3 [stat.ML] UPDATED)
    The limit of infinite width allows for substantial simplifications in the analytical study of over-parameterised neural networks. With a suitable random initialisation, an extremely large network exhibits an approximately Gaussian behaviour. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables, holding both before and during training. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimises the generalisation bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.  ( 2 min )
    When Can We Track Significant Preference Shifts in Dueling Bandits?. (arXiv:2302.06595v1 [cs.LG])
    The $K$-armed dueling bandits problem, where the feedback is in the form of noisy pairwise preferences, has been widely studied due its applications in information retrieval, recommendation systems, etc. Motivated by concerns that user preferences/tastes can evolve over time, we consider the problem of dueling bandits with distribution shifts. Specifically, we study the recent notion of significant shifts (Suk and Kpotufe, 2022), and ask whether one can design an adaptive algorithm for the dueling problem with $O(\sqrt{K\tilde{L}T})$ dynamic regret, where $\tilde{L}$ is the (unknown) number of significant shifts in preferences. We show that the answer to this question depends on the properties of underlying preference distributions. Firstly, we give an impossibility result that rules out any algorithm with $O(\sqrt{K\tilde{L}T})$ dynamic regret under the well-studied Condorcet and SST classes of preference distributions. Secondly, we show that $\text{SST} \cap \text{STI}$ is the largest amongst popular classes of preference distributions where it is possible to design such an algorithm. Overall, our results provides an almost complete resolution of the above question for the hierarchy of distribution classes.  ( 2 min )
    The infinite Viterbi alignment and decay-convexity. (arXiv:1810.04115v5 [math.PR] UPDATED)
    The infinite Viterbi alignment is the limiting maximum a-posteriori estimate of the unobserved path in a hidden Markov model as the length of the time horizon grows. For models on state-space $\mathbb{R}^{d}$ satisfying a new ``decay-convexity'' condition, we develop an approach to existence of the infinite Viterbi alignment in an infinite dimensional Hilbert space. Quantitative bounds on the distance to the infinite Viterbi alignment, which are the first of their kind, are derived and used to illustrate how approximate estimation via parallelization can be accurate and scaleable to high-dimensional problems because the rate of convergence to the infinite Viterbi alignment does not necessarily depend on $d$. The results are applied to approximate estimation via parallelization and a model of neural population activity.  ( 2 min )
    Kernel Ridge Regression Inference. (arXiv:2302.06578v1 [math.ST])
    We provide uniform confidence bands for kernel ridge regression (KRR), with finite sample guarantees. KRR is ubiquitous, yet--to our knowledge--this paper supplies the first exact, uniform confidence bands for KRR in the non-parametric regime where the regularization parameter $\lambda$ converges to 0, for general data distributions. Our proposed uniform confidence band is based on a new, symmetrized multiplier bootstrap procedure with a closed form solution, which allows for valid uncertainty quantification without assumptions on the bias. To justify the procedure, we derive non-asymptotic, uniform Gaussian and bootstrap couplings for partial sums in a reproducing kernel Hilbert space (RKHS) with bounded kernel. Our results imply strong approximation for empirical processes indexed by the RKHS unit ball, with sharp, logarithmic dependence on the covering number.  ( 2 min )
    Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data. (arXiv:2302.06232v1 [cs.LG])
    Language-supervised vision models have recently attracted great attention in computer vision. A common approach to build such models is to use contrastive learning on paired data across the two modalities, as exemplified by Contrastive Language-Image Pre-Training (CLIP). In this paper, under linear representation settings, (i) we initiate the investigation of a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) including CLIP loss and show its connection to singular value decomposition (SVD). Namely, we show that each step of loss minimization by gradient descent can be seen as performing SVD on a contrastive cross-covariance matrix. Based on this insight, (ii) we analyze the performance of MMCL. We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality even under the presence of wrongly matched pairs. This characterizes the robustness of MMCL to noisy data. Furthermore, when we have access to additional unpaired data, (iii) we propose a new MMCL loss that incorporates additional unpaired datasets. We show that the algorithm can detect the ground-truth pairs and improve performance by fully exploiting unpaired datasets. The performance of the proposed algorithm was verified by numerical experiments.  ( 2 min )
    Physics informed WNO. (arXiv:2302.05925v1 [stat.ML])
    Deep neural operators are recognized as an effective tool for learning solution operators of complex partial differential equations (PDEs). As compared to laborious analytical and computational tools, a single neural operator can predict solutions of PDEs for varying initial or boundary conditions and different inputs. A recently proposed Wavelet Neural Operator (WNO) is one such operator that harnesses the advantage of time-frequency localization of wavelets to capture the manifolds in the spatial domain effectively. While WNO has proven to be a promising method for operator learning, the data-hungry nature of the framework is a major shortcoming. In this work, we propose a physics-informed WNO for learning the solution operators of families of parametric PDEs without labeled training data. The efficacy of the framework is validated and illustrated with four nonlinear spatiotemporal systems relevant to various fields of engineering and science.  ( 2 min )
    Distribution-Free Model for Community Detection. (arXiv:2111.07495v4 [cs.SI] UPDATED)
    Community detection for unweighted networks has been widely studied in network analysis, but the case of weighted networks remains a challenge. This paper proposes a general Distribution-Free Model (DFM) for weighted networks in which nodes are partitioned into different communities. DFM can be seen as a generalization of the famous stochastic blockmodels from unweighted networks to weighted networks. DFM does not require prior knowledge of a specific distribution for elements of the adjacency matrix but only the expected value. In particular, signed networks with latent community structures can be modeled by DFM. We build a theoretical guarantee to show that a simple spectral clustering algorithm stably yields consistent community detection under DFM. We also propose a four-step data generation process to generate adjacency matrices with missing edges by combining DFM, noise matrix, and a model for unweighted networks. Using experiments with simulated and real datasets, we show that some benchmark algorithms can successfully recover community membership for weighted networks generated by the proposed data generation process.  ( 2 min )
    Incorporating Expert Opinion on Observable Quantities into Statistical Models -- A General Framework. (arXiv:2302.06391v1 [stat.ML])
    This article describes an approach to incorporate expert opinion on observable quantities through the use of a loss function which updates a prior belief as opposed to specifying parameters on the priors. Eliciting information on observable quantities allows experts to provide meaningful information on a quantity familiar to them, in contrast to elicitation on model parameters, which may be subject to interactions with other parameters or non-linear transformations before obtaining an observable quantity. The approach to incorporating expert opinion described in this paper is distinctive in that we do not specify a prior to match an expert's opinion on observed quantity, rather we obtain a posterior by updating the model parameters through a loss function. This loss function contains the observable quantity, expressed a function of the parameters, and is related to the expert's opinion which is typically operationalized as a statistical distribution. Parameters which generate observable quantities which are further from the expert's opinion incur a higher loss, allowing for the model parameters to be estimated based on their fidelity to both the data and expert opinion, with the relative strength determined by the number of observations and precision of the elicited belief. Including expert opinion in this fashion allows for a flexible specification of the opinion and in many situations is straightforward to implement with commonly used probabilistic programming software. We highlight this using three worked examples of varying model complexity including survival models, a multivariate normal distribution and a regression problem.  ( 2 min )
    Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD. (arXiv:2302.06570v1 [stat.ML])
    This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pervasive $L_0$-smoothness. This class is rich enough to include highly non-smooth functions, such as $\exp(L_1 x)$ which is $(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the $L_0$-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate. We develop a technique that allows us to prove $\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for $(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time $\tau$ which is simultaneously "large" on average, yet also allows us to treat the adaptive step sizes before $\tau$ as (roughly) independent of the gradients. For general $(L_0,L_1)$-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter $\sigma_1 1$.  ( 2 min )
    Density-Softmax: Scalable and Distance-Aware Uncertainty Estimation under Distribution Shifts. (arXiv:2302.06495v1 [cs.LG])
    Prevalent deep learning models suffer from significant over-confidence under distribution shifts. In this paper, we propose Density-Softmax, a single deterministic approach for uncertainty estimation via a combination of density function with the softmax layer. By using the latent representation's likelihood value, our approach produces more uncertain predictions when test samples are distant from the training samples. Theoretically, we prove that Density-Softmax is distance aware, which means its associated uncertainty metrics are monotonic functions of distance metrics. This has been shown to be a necessary condition for a neural network to produce high-quality uncertainty estimation. Empirically, our method enjoys similar computational efficiency as standard softmax on shifted CIFAR-10, CIFAR-100, and ImageNet dataset across modern deep learning architectures. Notably, Density-Softmax uses 4 times fewer parameters than Deep Ensembles and 6 times lower latency than Rank-1 Bayesian Neural Network, while obtaining competitive predictive performance and lower calibration errors under distribution shifts.  ( 2 min )
    One-Shot Federated Conformal Prediction. (arXiv:2302.06322v1 [stat.ML])
    In this paper, we introduce a conformal prediction method to construct prediction sets in a oneshot federated learning setting. More specifically, we define a quantile-of-quantiles estimator and prove that for any distribution, it is possible to output prediction sets with desired coverage in only one round of communication. To mitigate privacy issues, we also describe a locally differentially private version of our estimator. Finally, over a wide range of experiments, we show that our method returns prediction sets with coverage and length very similar to those obtained in a centralized setting. Overall, these results demonstrate that our method is particularly well-suited to perform conformal predictions in a one-shot federated learning setting.  ( 2 min )
    Mean Field Optimization Problem Regularized by Fisher Information. (arXiv:2302.05938v1 [math.PR])
    Recently there is a rising interest in the research of mean field optimization, in particular because of its role in analyzing the training of neural networks. In this paper by adding the Fisher Information as the regularizer, we relate the regularized mean field optimization problem to a so-called mean field Schrodinger dynamics. We develop an energy-dissipation method to show that the marginal distributions of the mean field Schrodinger dynamics converge exponentially quickly towards the unique minimizer of the regularized optimization problem. Remarkably, the mean field Schrodinger dynamics is proved to be a gradient flow on the probability measure space with respect to the relative entropy. Finally we propose a Monte Carlo method to sample the marginal distributions of the mean field Schrodinger dynamics.  ( 2 min )
    Isotopic envelope identification by analysis of the spatial distribution of components in MALDI-MSI data. (arXiv:2302.06051v1 [stat.ML])
    One of the significant steps in the process leading to the identification of proteins is mass spectrometry, which allows for obtaining information about the structure of proteins. Removing isotope peaks from the mass spectrum is vital and it is done in a process called deisotoping. There are different algorithms for deisotoping, but they have their limitations, they are dedicated to different methods of mass spectrometry. Data from experiments performed with the MALDI-ToF technique are characterized by high dimensionality. This paper presents a method for identifying isotope envelopes in MALDI-ToF molecular imaging data based on the Mamdani-Assilan fuzzy system and spatial maps of the molecular distribution of peaks included in the isotopic envelope. Several image texture measures were used to evaluate spatial molecular distribution maps. The algorithm was tested on eight datasets obtained from the MALDI-ToF experiment on samples from the National Institute of Oncology in Gliwice from patients with cancer of the head and neck region. The data were subjected to pre-processing and feature extraction. The results were collected and compared with three existing deisotoping algorithms. The analysis of the obtained results showed that the method for identifying isotopic envelopes proposed in this paper enables the detection of overlapping envelopes by using the approach oriented to study peak pairs. Moreover, the proposed algorithm enables the analysis of large data sets.  ( 2 min )
    Dimension Reduction and MARS. (arXiv:2302.05790v1 [stat.ME])
    The multivariate adaptive regression spline (MARS) is one of the popular estimation methods for nonparametric multivariate regressions. However, as MARS is based on marginal splines, to incorporate interactions of covariates, products of the marginal splines must be used, which leads to an unmanageable number of basis functions when the order of interaction is high and results in low estimation efficiency. In this paper, we improve the performance of MARS by using linear combinations of the covariates which achieve sufficient dimension reduction. The special basis functions of MARS facilitate calculation of gradients of the regression function, and estimation of the linear combinations is obtained via eigen-analysis of the outer-product of the gradients. Under some technical conditions, the asymptotic theory is established for the proposed estimation method. Numerical studies including both simulation and empirical applications show its effectiveness in dimension reduction and improvement over MARS and other commonly-used nonparametric methods in regression estimation and prediction.  ( 2 min )
    Optimizing Orthogonalized Tensor Deflation via Random Tensor Theory. (arXiv:2302.05798v1 [stat.ML])
    This paper tackles the problem of recovering a low-rank signal tensor with possibly correlated components from a random noisy tensor, or so-called spiked tensor model. When the underlying components are orthogonal, they can be recovered efficiently using tensor deflation which consists of successive rank-one approximations, while non-orthogonal components may alter the tensor deflation mechanism, thereby preventing efficient recovery. Relying on recently developed random tensor tools, this paper deals precisely with the non-orthogonal case by deriving an asymptotic analysis of a parameterized deflation procedure performed on an order-three and rank-two spiked tensor. Based on this analysis, an efficient tensor deflation algorithm is proposed by optimizing the parameter introduced in the deflation mechanism, which in turn is proven to be optimal by construction for the studied tensor model. The same ideas could be extended to more general low-rank tensor models, e.g., higher ranks and orders, leading to more efficient tensor methods with a broader impact on machine learning and beyond.  ( 2 min )
    Windowed Fourier Analysis for Signal Processing on Graph Bundles. (arXiv:2302.05592v1 [eess.SP])
    We consider the task of representing signals supported on graph bundles, which are generalizations of product graphs that allow for "twists" in the product structure. Leveraging the localized product structure of a graph bundle, we demonstrate how a suitable partition of unity over the base graph can be used to lift the signal on the graph into a space where a product factorization can be readily applied. Motivated by the locality of this procedure, we demonstrate that bases for the signal spaces of the components of the graph bundle can be lifted in the same way, yielding a basis for the signal space of the total graph. We demonstrate this construction on synthetic graphs, as well as with an analysis of the energy landscape of conformational manifolds in stereochemistry.  ( 2 min )
    Communication and Storage Efficient Federated Split Learning. (arXiv:2302.05599v1 [cs.IT])
    Federated learning (FL) is a popular distributed machine learning (ML) paradigm, but is often limited by significant communication costs and edge device computation capabilities. Federated Split Learning (FSL) preserves the parallel model training principle of FL, with a reduced device computation requirement thanks to splitting the ML model between the server and clients. However, FSL still incurs very high communication overhead due to transmitting the smashed data and gradients between the clients and the server in each global round. Furthermore, the server has to maintain separate models for every client, resulting in a significant computation and storage requirement that grows linearly with the number of clients. This paper tries to solve these two issues by proposing a communication and storage efficient federated and split learning (CSE-FSL) strategy, which utilizes an auxiliary network to locally update the client models while keeping only a single model at the server, hence avoiding the communication of gradients from the server and greatly reducing the server resource requirement. Communication cost is further reduced by only sending the smashed data in selected epochs from the clients. We provide a rigorous theoretical analysis of CSE-FSL that guarantees its convergence for non-convex loss functions. Extensive experimental results demonstrate that CSE-FSL has a significant communication reduction over existing FSL techniques while achieving state-of-the-art convergence and model accuracy, using several real-world FL tasks.  ( 2 min )
    Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD. (arXiv:2302.05516v1 [stat.ML])
    Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called "tail-index") in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.  ( 2 min )
    Pushing the Accuracy-Group Robustness Frontier with Introspective Self-play. (arXiv:2302.05807v1 [cs.LG])
    Standard empirical risk minimization (ERM) training can produce deep neural network (DNN) models that are accurate on average but under-perform in under-represented population subgroups, especially when there are imbalanced group distributions in the long-tailed training data. Therefore, approaches that improve the accuracy-group robustness trade-off frontier of a DNN model (i.e. improving worst-group accuracy without sacrificing average accuracy, or vice versa) is of crucial importance. Uncertainty-based active learning (AL) can potentially improve the frontier by preferentially sampling underrepresented subgroups to create a more balanced training dataset. However, the quality of uncertainty estimates from modern DNNs tend to degrade in the presence of spurious correlations and dataset bias, compromising the effectiveness of AL for sampling tail groups. In this work, we propose Introspective Self-play (ISP), a simple approach to improve the uncertainty estimation of a deep neural network under dataset bias, by adding an auxiliary introspection task requiring a model to predict the bias for each data point in addition to the label. We show that ISP provably improves the bias-awareness of the model representation and the resulting uncertainty estimates. On two real-world tabular and language tasks, ISP serves as a simple "plug-in" for AL model training, consistently improving both the tail-group sampling rate and the final accuracy-fairness trade-off frontier of popular AL methods.  ( 2 min )
    From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks. (arXiv:2302.05882v1 [stat.ML])
    This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying analysis bridges different regimes of interest, such as the classical gradient-flow regime of vanishing learning rate, the high-dimensional regime of large input dimension, and the overparameterised "mean-field" regime of large network width, covering as well the intermediate regimes where the limiting dynamics is determined by the interplay between these behaviours. In particular, in the high-dimensional limit, the infinite-width dynamics is found to remain close to a low-dimensional subspace spanned by the target principal directions. Our results therefore provide a unifying picture of the limiting SGD dynamics with synthetic data.  ( 2 min )
    Sequential Underspecified Instrument Selection for Cause-Effect Estimation. (arXiv:2302.05684v1 [stat.ME])
    Instrumental variable (IV) methods are used to estimate causal effects in settings with unobserved confounding, where we cannot directly experiment on the treatment variable. Instruments are variables which only affect the outcome indirectly via the treatment variable(s). Most IV applications focus on low-dimensional treatments and crucially require at least as many instruments as treatments. This assumption is restrictive: in the natural sciences we often seek to infer causal effects of high-dimensional treatments (e.g., the effect of gene expressions or microbiota on health and disease), but can only run few experiments with a limited number of instruments (e.g., drugs or antibiotics). In such underspecified problems, the full treatment effect is not identifiable in a single experiment even in the linear case. We show that one can still reliably recover the projection of the treatment effect onto the instrumented subspace and develop techniques to consistently combine such partial estimates from different sets of instruments. We then leverage our combined estimators in an algorithm that iteratively proposes the most informative instruments at each round of experimentation to maximize the overall information about the full causal effect.  ( 2 min )
    Achieving acceleration despite very noisy gradients. (arXiv:2302.05515v1 [stat.ML])
    We present a novel momentum-based first order optimization method (AGNES) which provably achieves acceleration for convex minimization, even if the stochastic noise in the gradient estimates is many orders of magnitude larger than the gradient itself. Here we model the noise as having a variance which is proportional to the magnitude of the underlying gradient. We argue, based upon empirical evidence, that this is appropriate for mini-batch gradients in overparameterized deep learning. Furthermore, we demonstrate that the method achieves competitive performance in the training of CNNs on MNIST and CIFAR-10.  ( 2 min )
    Tighter PAC-Bayes Bounds Through Coin-Betting. (arXiv:2302.05829v1 [cs.LG])
    We consider the problem of estimating the mean of a sequence of random elements $f(X_1, \theta)$ $, \ldots, $ $f(X_n, \theta)$ where $f$ is a fixed scalar function, $S=(X_1, \ldots, X_n)$ are independent random variables, and $\theta$ is a possibly $S$-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on $n$ examples where $f$ is a loss function. Classically, this problem is approached through concentration inequalities holding uniformly over compact parameter sets of functions $f$, for example as in Rademacher or VC type analysis. However, in many problems, such inequalities often yield numerically vacuous estimates. Recently, the \emph{PAC-Bayes} framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve \emph{even tighter} guarantees. Our approach is based on the \emph{coin-betting} framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms. In particular, we derive the first PAC-Bayes concentration inequality based on the coin-betting approach that holds simultaneously for all sample sizes. We demonstrate its tightness showing that by \emph{relaxing} it we obtain a number of previous results in a closed form including Bernoulli-KL and empirical Bernstein inequalities. Finally, we propose an efficient algorithm to numerically calculate confidence sequences from our bound, which often generates nonvacuous confidence bounds even with one sample, unlike the state-of-the-art PAC-Bayes bounds.  ( 2 min )
    Graph Neural Network-Inspired Kernels for Gaussian Processes in Semi-Supervised Learning. (arXiv:2302.05828v1 [cs.LG])
    Gaussian processes (GPs) are an attractive class of machine learning models because of their simplicity and flexibility as building blocks of more complex Bayesian models. Meanwhile, graph neural networks (GNNs) emerged recently as a promising class of models for graph-structured data in semi-supervised learning and beyond. Their competitive performance is often attributed to a proper capturing of the graph inductive bias. In this work, we introduce this inductive bias into GPs to improve their predictive performance for graph-structured data. We show that a prominent example of GNNs, the graph convolutional network, is equivalent to some GP when its layers are infinitely wide; and we analyze the kernel universality and the limiting behavior in depth. We further present a programmable procedure to compose covariance kernels inspired by this equivalence and derive example kernels corresponding to several interesting members of the GNN family. We also propose a computationally efficient approximation of the covariance matrix for scalable posterior inference with large-scale data. We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time, compared with the respective GNNs.  ( 2 min )
    An unsupervised learning approach for predicting wind farm power and downstream wakes using weather patterns. (arXiv:2302.05886v1 [stat.ML])
    Wind energy resource assessment typically requires numerical models, but such models are too computationally intensive to consider multi-year timescales. Increasingly, unsupervised machine learning techniques are used to identify a small number of representative weather patterns to simulate long-term behaviour. Here we develop a novel wind energy workflow that for the first time combines weather patterns derived from unsupervised clustering techniques with numerical weather prediction models (here WRF) to obtain efficient and accurate long-term predictions of power and downstream wakes from an entire wind farm. We use ERA5 reanalysis data clustering not only on low altitude pressure but also, for the first time, on the more relevant variable of wind velocity. We also compare the use of large-scale and local-scale domains for clustering. A WRF simulation is run at each of the cluster centres and the results are aggregated using a novel post-processing technique. By applying our workflow to two different regions, we show that our long-term predictions agree with those from a year of WRF simulations but require less than 2% of the computational time. The most accurate results are obtained when clustering on wind velocity. Moreover, clustering over the Europe-wide domain is sufficient for predicting wind farm power output, but downstream wake predictions benefit from the use of smaller domains. Finally, we show that these downstream wakes can affect the local weather patterns. Our approach facilitates multi-year predictions of power output and downstream farm wakes, by providing a fast, accurate and flexible methodology that is applicable to any global region. Moreover, these accurate long-term predictions of downstream wakes provide the first tool to help mitigate the effects of wind energy loss downstream of wind farms, since they can be used to determine optimum wind farm locations.  ( 3 min )
    I$^2$SB: Image-to-Image Schr\"odinger Bridge. (arXiv:2302.05872v1 [cs.CV])
    We propose Image-to-Image Schr\"odinger Bridge (I$^2$SB), a new class of conditional diffusion models that directly learn the nonlinear diffusion processes between two given distributions. These diffusion bridges are particularly useful for image restoration, as the degraded images are structurally informative priors for reconstructing the clean images. I$^2$SB belongs to a tractable class of Schr\"odinger bridge, the nonlinear extension to score-based models, whose marginal distributions can be computed analytically given boundary pairs. This results in a simulation-free framework for nonlinear diffusions, where the I$^2$SB training becomes scalable by adopting practical techniques used in standard diffusion models. We validate I$^2$SB in solving various image restoration tasks, including inpainting, super-resolution, deblurring, and JPEG restoration on ImageNet 256x256 and show that I$^2$SB surpasses standard conditional diffusion models with more interpretable generative processes. Moreover, I$^2$SB matches the performance of inverse methods that additionally require the knowledge of the corruption operators. Our work opens up new algorithmic opportunities for developing efficient nonlinear diffusion models on a large scale. scale. Project page: https://i2sb.github.io/  ( 2 min )
    Confidence and Uncertainty Assessment for Distributional Random Forests. (arXiv:2302.05761v1 [math.ST])
    The Distributional Random Forest (DRF) is a recently introduced Random Forest algorithm to estimate multivariate conditional distributions. Due to its general estimation procedure, it can be employed to estimate a wide range of targets such as conditional average treatment effects, conditional quantiles, and conditional correlations. However, only results about the consistency and convergence rate of the DRF prediction are available so far. We characterize the asymptotic distribution of DRF and develop a bootstrap approximation of it. This allows us to derive inferential tools for quantifying standard errors and the construction of confidence regions that have asymptotic coverage guarantees. In simulation studies, we empirically validate the developed theory for inference of low-dimensional targets and for testing distributional differences between two populations.  ( 2 min )
    A High-dimensional Convergence Theorem for U-statistics with Applications to Kernel-based Testing. (arXiv:2302.05686v1 [math.ST])
    We prove a convergence theorem for U-statistics of degree two, where the data dimension $d$ is allowed to scale with sample size $n$. We find that the limiting distribution of a U-statistic undergoes a phase transition from the non-degenerate Gaussian limit to the degenerate limit, regardless of its degeneracy and depending only on a moment ratio. A surprising consequence is that a non-degenerate U-statistic in high dimensions can have a non-Gaussian limit with a larger variance and asymmetric distribution. Our bounds are valid for any finite $n$ and $d$, independent of individual eigenvalues of the underlying function, and dimension-independent under a mild assumption. As an application, we apply our theory to two popular kernel-based distribution tests, MMD and KSD, whose high-dimensional performance has been challenging to study. In a simple empirical setting, our results correctly predict how the test power at a fixed threshold scales with $d$ and the bandwidth.  ( 2 min )
    Robust Knowledge Transfer in Tiered Reinforcement Learning. (arXiv:2302.05534v1 [cs.LG])
    In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the "Optimal Value Dominance" for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.  ( 2 min )
    Deep Neural Networks for Nonparametric Interaction Models with Diverging Dimension. (arXiv:2302.05851v1 [math.ST])
    Deep neural networks have achieved tremendous success due to their representation power and adaptation to low-dimensional structures. Their potential for estimating structured regression functions has been recently established in the literature. However, most of the studies require the input dimension to be fixed and consequently ignore the effect of dimension on the rate of convergence and hamper their applications to modern big data with high dimensionality. In this paper, we bridge this gap by analyzing a $k^{th}$ order nonparametric interaction model in both growing dimension scenarios ($d$ grows with $n$ but at a slower rate) and in high dimension ($d \gtrsim n$). In the latter case, sparsity assumptions and associated regularization are required in order to obtain optimal rates of convergence. A new challenge in diverging dimension setting is in calculation mean-square error, the covariance terms among estimated additive components are an order of magnitude larger than those of the variances and they can deteriorate statistical properties without proper care. We introduce a critical debiasing technique to amend the problem. We show that under certain standard assumptions, debiased deep neural networks achieve a minimax optimal rate both in terms of $(n, d)$. Our proof techniques rely crucially on a novel debiasing technique that makes the covariances of additive components negligible in the mean-square error calculation. In addition, we establish the matching lower bounds.  ( 2 min )
    Differentially Private Normalizing Flows for Density Estimation, Data Synthesis, and Variational Inference with Application to Electronic Health Records. (arXiv:2302.05787v1 [stat.ML])
    Electronic health records (EHR) often contain sensitive medical information about individual patients, posing significant limitations to sharing or releasing EHR data for downstream learning and inferential tasks. We use normalizing flows (NF), a family of deep generative models, to estimate the probability density of a dataset with differential privacy (DP) guarantees, from which privacy-preserving synthetic data are generated. We apply the technique to an EHR dataset containing patients with pulmonary hypertension. We assess the learning and inferential utility of the synthetic data by comparing the accuracy in the prediction of the hypertension status and variational posterior distribution of the parameters of a physics-based model. In addition, we use a simulated dataset from a nonlinear model to compare the results from variational inference (VI) based on privacy-preserving synthetic data, and privacy-preserving VI obtained from directly privatizing NFs for VI with DP guarantees given the original non-private dataset. The results suggest that synthetic data generated through differentially private density estimation with NF can yield good utility at a reasonable privacy cost. We also show that VI obtained from differentially private NF based on the free energy bound loss may produce variational approximations with significantly altered correlation structure, and loss formulations based on alternative dissimilarity metrics between two distributions might provide improved results.  ( 2 min )
    Global Convergence Rate of Deep Equilibrium Models with General Activations. (arXiv:2302.05797v1 [stat.ML])
    In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation and proved that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. In this paper, we show that this fact still holds for DEQs with any general activation which has bounded first and second derivatives. Since the new activation function is generally non-linear, a general population Gram matrix is designed, and a new form of dual activation with Hermite polynomial expansion is developed.  ( 2 min )
    Koopman-Based Bound for Generalization: New Aspect of Neural Networks Regarding Nonlinear Noise Filtering. (arXiv:2302.05825v1 [cs.LG])
    We propose a new bound for generalization of neural networks using Koopman operators. Unlike most of the existing works, we focus on the role of the final nonlinear transformation of the networks. Our bound is described by the reciprocal of the determinant of the weight matrices and is tighter than existing norm-based bounds when the weight matrices do not have small singular values. According to existing theories about the low-rankness of the weight matrices, it may be counter-intuitive that we focus on the case where singular values of weight matrices are not small. However, motivated by the final nonlinear transformation, we can see that our result sheds light on a new perspective regarding a noise filtering property of neural networks. Since our bound comes from Koopman operators, this work also provides a connection between operator-theoretic analysis and generalization of neural networks. Numerical results support the validity of our theoretical results.  ( 2 min )
    Distributional GFlowNets with Quantile Flows. (arXiv:2302.05793v1 [cs.LG])
    Generative Flow Networks (GFlowNets) are a new family of probabilistic samplers where an agent learns a stochastic policy for generating complex combinatorial structure through a series of decision-making steps. Despite being inspired from reinforcement learning, the current GFlowNet framework is relatively limited in its applicability and cannot handle stochasticity in the reward function. In this work, we adopt a distributional paradigm for GFlowNets, turning each flow function into a distribution, thus providing more informative learning signals during training. By parameterizing each edge flow through their quantile functions, our proposed \textit{quantile matching} GFlowNet learning algorithm is able to learn a risk-sensitive policy, an essential component for handling scenarios with risk uncertainty. Moreover, we find that the distributional approach can achieve substantial improvement on existing benchmarks compared to prior methods due to our enhanced training algorithm, even in settings with deterministic rewards.  ( 2 min )
    Efficient Fraud Detection using Deep Boosting Decision Trees. (arXiv:2302.05918v1 [stat.ML])
    Fraud detection is to identify, monitor, and prevent potentially fraudulent activities from complex data. The recent development and success in AI, especially machine learning, provides a new data-driven way to deal with fraud. From a methodological point of view, machine learning based fraud detection can be divided into two categories, i.e., conventional methods (decision tree, boosting...) and deep learning, both of which have significant limitations in terms of the lack of representation learning ability for the former and interpretability for the latter. Furthermore, due to the rarity of detected fraud cases, the associated data is usually imbalanced, which seriously degrades the performance of classification algorithms. In this paper, we propose deep boosting decision trees (DBDT), a novel approach for fraud detection based on gradient boosting and neural networks. In order to combine the advantages of both conventional methods and deep learning, we first construct soft decision tree (SDT), a decision tree structured model with neural networks as its nodes, and then ensemble SDTs using the idea of gradient boosting. In this way we embed neural networks into gradient boosting to improve its representation learning capability and meanwhile maintain the interpretability. Furthermore, aiming at the rarity of detected fraud cases, in the model training phase we propose a compositional AUC maximization approach to deal with data imbalances at algorithm level. Extensive experiments on several real-life fraud detection datasets show that DBDT can significantly improve the performance and meanwhile maintain good interpretability. Our code is available at https://github.com/freshmanXB/DBDT.  ( 2 min )

  • Open

    [D] Noam Brown, FAIR: On achieving human-level performance in poker and Diplomacy, and the power of spending compute at inference time
    Here is a podcast episode with Noam Brown from Meta AI where we discuss his work on achieving human-level performance on poker and Diplomacy, as well as the power of spending compute at inference time! submitted by /u/thejashGI [link] [comments]  ( 42 min )
    [D] Retrieval transformers with learnable queries?
    Retrieval transformer models like RETRO seem to use frozen embeddings both for the documents in the database and the currently completed document ("the query"). Making the embeddings of documents in the database learnable would defeat the purpose, as retrieval transformers only make sense when the database is huge. It seems that the query embedding could be made learnable - the model could learn to extract more useful documents this way. Have you seen any research that does this? submitted by /u/zielmicha [link] [comments]  ( 42 min )
    [R] [N] REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers
    Paper: https://arxiv.org/abs/2302.02041 Generate synthetic data from single tabular data using GPT. It also works on relational datasets! No fine-tuning and works out-of-the-box. We also removed the guesswork on how long (epochs) the generative model for a single tabular data is trained. We propose the Qδ statistic and apply statistical bootstrapping to define a threshold to robustly detect overfitting. Perk: no need for a hold-out data! Data copying is also a problem in generative models. This means that training data may be learned and copied by the model during sampling. We attempt to mitigate data copying. We implement target masking to deliberately create missing values in each observation in the data. The mask is a special token that is ignored during sampling. This forces the model to probabilistically impute the token, adding uncertainty to the generated data. REaLTabFormer is open-sourced and available on PyPi → pip install realtabformer ​ https://preview.redd.it/vhf1st2g28ia1.png?width=1998&format=png&auto=webp&s=8cbc4f74bc04e2b6da16acf88f347fb1995bd5cf submitted by /u/avsolatorio [link] [comments]  ( 43 min )
    [D] Looking for advice on model architecture for embedding facial landmark coordinates into StyleGAN2 latentspace
    I am currently working on a project where I need to embed facial landmark coordinates into StyleGAN2 latentspace. The input data is structured as follows: [batch_size, num_landmarks=138, num_coordinates=3 (x,y,z)]. The output data is structured as: [batch_size, stylegan2_latent_space=512]. I have PyTorch experience and am experimenting with transformer like models for the embedding. However, I am unsure about the optimal architecture for this task, and I would appreciate any advice or recommendations on how to design a suitable model. Has anyone worked on a similar task before, or have any ideas about which architecture could work well for this problem? Any advice or resources would be greatly appreciated. Thank you! submitted by /u/willowill5 [link] [comments]  ( 43 min )
    [Discussion] Computing the derivative of a diffusion model with respect to the prompt
    Hi, I was wondering if anyone came across a paper that approximated the derivative of a diffusion model with respect to the conditioning that is fed into the cross-attention module. So let's say we have a text that is already transformed into a continuous embedding. Then this goes through the llm and is fed into the cross-attention module at every timestep. At the end of the diffusion process, we get some image/a latent representation of an image in the case of stable diffusion. We can then calculate a loss on that image and in theory calculate the gradient with respect to the continuous text embedding if we use a non-stochastic sampler like DDIM. The issue is the length of the graph calculating that derivative is super expensive. I was if anyone already solved this or has some good references. Thanks :) submitted by /u/arg_max [link] [comments]  ( 43 min )
    [D] self supervised learning for regression with tabular numerical data
    Hi all, Im trying to implement self supervised pretraining to tabular data regression problem, however since the literature is scarce i’m stuck in the augmentation stage. Im currently using sim siam self supervision with gaussian noising and input dropout. I tried shuffling to mimic CV approaches but it failed miserably. Any advice? submitted by /u/No-Front-4346 [link] [comments]  ( 42 min )
    [R] Scaling Vision Transformers to 22 Billion Parameters
    submitted by /u/nateharada [link] [comments]  ( 42 min )
    [N] Miniworld is now a mature project within the Farama Foundation
    Miniworld - a minimalistic 3D interior environment simulator for reinforcement learning & robotics research that allows environments to be easily edited - has now reached the mature inside Farama. You can check out the documentation at https://miniworld.farama.org, and the release notes for all the changes we’ve made to the project at https://github.com/Farama-Foundation/Miniworld/releases/tag/2.0.1. submitted by /u/jkterry1 [link] [comments]  ( 42 min )
    [D] Tensorflow struggles
    This may be a bit of a vent. I am currently working on a model with Tensorflow. To me it seems that whenever I am straying from a certain path my productivity starts dying at an alarming rate. For example I am currently implementing my own data augmentation (because I strayed from Tf in a minuscule way) and obscure errors are littering my path. Prior to that I made a mistake somewhere in my training loop and it took me forever to find. The list goes on. Every time I try using Tensorflow in a new way, it‘s like taming a new horse. Except that it‘s the same donkey I tamed last time. This is not my first project, but does it ever change? submitted by /u/H0lzm1ch3l [link] [comments]  ( 47 min )
    [R] Hitchhiker’s Guide to Super-Resolution: Introduction and Recent Advances
    I'm glad to share with you our Open Access survey paper about image super-resolution: https://ieeexplore.ieee.org/abstract/document/10041995 The goal of this work is to give an overview of the abundance of publications in image super-resolution, give an introduction for new researchers, and open thriving discussions as well as point to potential future directions to advance the field :) submitted by /u/Maleficent_Stay_7737 [link] [comments]  ( 43 min )
    [D] Repeating important samples in every batch for NN training?
    Wondering if there’s a term for this. I’m training NNs for a scenario that works best with a small batch size, there are therefore many batches. There are a couple particular samples that are VERY important. Let’s say 3 important samples out of thousands I train to. I found end application is best when I include these important samples, repeated, in every batch. This is opposed to simply giving the samples a large weight, because the large weight doesn’t matter after looping through many batches in an epoch. So the NN learns the other less important stuff while being forced to remain in good agreement with the important samples. Does this technique have a name? EDIT: In case anyone is curious, these are physics informed NNs and the important samples are equilibrium mechanical structures. The NN therefore learns what equilibrium is, with everything else being small deviations from equilibrium. submitted by /u/zxkj [link] [comments]  ( 44 min )
    [R] Imagenet 2015 VID Dataset
    Hi all, I saw a few posts already but just to make sure and keep this as an update, does anyone have the ImageNet 2015 VID dataset to share? All links are dead. I really need it now to train TransVOD. submitted by /u/Forsaken_Football227 [link] [comments]  ( 42 min )
    [D] Threshold for k-means anomaly detection
    I am using kmeans clustering algorithm for anomaly detection. After training kmeans, I'm calculation Euclidian distance of new data points to their nearest cluster. Please suggest me some strategies to set up a threshold such that point with distance greater than that threshold will be classified as anomaly. Or tell me if there are some other way to identify anomaly using k-means. submitted by /u/TKMater [link] [comments]  ( 43 min )
    [Discussion] The need for noise in stable diffusion
    As I'm learning about how stable diffusion works, I can't figure out why during image generation there's a need to deal with 'noise'. I know I'm glossing over a lot of details, but my understanding is that the algorithm is trained by gradually adding noise to an image and then de-noising it to recover the initial image. Wouldn't this be functionally equivalent to a machine that starts with an image, gradually reduces it to a blank canvas (all white), and then gradually reconstructs the original image? Then, post training, the generative process would just start with a blank canvas and gradually generate the image based on the input string provided. The idea of generating an image from a blank canvas feels more satisfying to me than revealing an image hidden by noise, but I'm sure there's a mathematical/technical reason why what I'm suggesting doesn't work. Appreciate any insight into this! submitted by /u/AdministrationOk2735 [link] [comments]  ( 44 min )
    [D] A Comprehensive Guide & Hand-Curated Resource List for Prompt Engineering and LLMs on Github
    Greetings, Excited to share with all those interested in Prompt Engineering and Large Language Models (LLMs)! We've hand-curated a comprehensive, Free & Open Source resource list on Github that includes everything related to Prompt Engineering, LLMs, and all related topics. We've covered most things, from papers and articles to tools and code! https://preview.redd.it/zzs09fg1l4ia1.png?width=1770&format=png&auto=webp&s=710cebff057e749b4c1d11a85711afef544c114f submitted by /u/aadityaura [link] [comments]  ( 43 min )
    [D] What are your tricks/infra working with embeddings?
    I'm trying to design my infra for creating, storing, and retrieving embeddings in my AI applications and was wondering what are the different paths for it. I'm especially interested in NLP, but vision/multimodal could be interesting too. Whether it's related to performance, scalability, or something else entirely, I'd love to hear your experiences and insights. Looking forward to your responses! submitted by /u/louis3195 [link] [comments]  ( 42 min )
    [P] Free GPT3-based tool to suggest terminal commands via natural language
    I whipped this up today. Credit to heyCLI for the idea, I've just remade an open source version. Basically in your terminal you type 'yo ' and then describe what you want a command to do. For instance: ➜ ~ yo enable a reverse tunnel through ssh Returns: Suggested command: ssh -R :localhost: @ Another example: ➜ ~ yo launch tensorboard with a custom log dir and port Suggested command: tensorboard --logdir= --port= It's free, MIT licence. You just need a free OpenAI API key which you can get by signing up on their website (I think if you use ChatGPT, you're already signed up). More info in the repo. Contributions/critiques welcome. submitted by /u/lfotofilter [link] [comments]  ( 43 min )
    [D] Anyone interested in training an AI for Tigris and Euphrates?
    Over the past weekend, I finally decided to put this idea to rest and made a Rust implementation of the greatest board game ever made - up there with Chess and Go: [Link to BGG](https://boardgamegeek.com/boardgame/42/tigris-euphrates The ultimate goal is to train an AI so it needs to be very fast with state updates. The game logic is quite sophisticated(~2000 lines) so it took me awhile to check all the edge cases of which there are many. Its search tree is huuuge with a branching factor of 100-300 which is more than Go's. It is also an imperfect game with hidden information(think poker). So ultimately it will need a reinforcement-based AI like [AlphaGo](https://arxiv.org/abs/2112.03178. In the repo I used a minimax-based AI(for testing purposes) to search 3 moves ahead which gives slightly better than random performance. The UI is implemented in [macroquad](https://macroquad.rs/examples/ which is hands down the simplest 2D game library I've used(ggez and a deprecated framework which I shall not name). And yes, please excuse the programmer art made by me :P Any way, here's the link to the repo if you are interested: repo ​ Note: it's hardcoded for 2 players but it can easily be made for 4. I want to train the AI for 2 players first. There are also 4 unimplemented rules: monuments, tile removal after war, must take corner treasures first, must take treasure after conflict. ​ https://preview.redd.it/jr49d9cx74ia1.png?width=1200&format=png&auto=webp&s=556edf1021b16878882dc8da0267327abb255e60 submitted by /u/0b01 [link] [comments]  ( 43 min )
    [R] Boosted Trees Literature
    Hi all. I’m trying to do a comprehensive study on the theory of gradient boosted trees (on the more recent algorithms xgboost, lightgbm etc). I was wondering what books you have read that contain substantial information on this topic. Any suggestions are appreciated! submitted by /u/ConfidenceFun5105 [link] [comments]  ( 42 min )
    [R] [P] LUCAS: LUng CAncer Screening dataset
    https://preview.redd.it/fo3y2s26q3ia1.png?width=1365&format=png&auto=webp&s=47653d3655c8888945f0e23e63ba88e521bc41df I want to download this dataset which has been introduced in the article named LUCAS: LUng CAncer Screening with Multimodal Biomarkers. Following the corresponding github of this project, authors have noted that the dataset is published in http://157.253.243.19/LUCAS/ but I can't access this link and ping to this address. Anyone has used this dataset could share it with me? Or if you know other ways to access it, too. Thank you very much! submitted by /u/kandalete [link] [comments]  ( 43 min )
    [D] What are the best ways to make and run a fast custom TTS?
    Just was wondering what the current best / easiest ways to make a fast custom tts are. I tried tortoise tts but it was too slow. The voice doesn't need to be a perfect clone, just need something that can resemble it. submitted by /u/crazewill [link] [comments]  ( 43 min )
    [P] Looking for an ML data analyst experienced with EEG or time series data interested in decoding brain activity for telepathic technology!
    Interested in Artificial Intelligence, software, psychology, linguistics, or the erupting neurotech industry? Checkout our website or dm me! Tech tycoons like Elon Musk, Bill Gates, and Jeff Bezos are investing time and resources into a way to output words directly from our psyche. It's assumed that without needing to take the time to type, we'll post, purchase, or search faster and more often. Using commercially available EEGs, my team at Minds Applied is recording snapshots of brain activity while thinking various words. After training a Cognitive Neural Network on this information, we’re able to predict the most likely thoughts you are thinking in real time. With prediction rates of over 80%, we’ve shifted our focus to acquiring funding and expanding our team! Message me if your int…  ( 45 min )
  • Open

    Noam Brown, FAIR: On achieving human-level performance in poker and Diplomacy, and the power of spending compute at inference time
    Here is a podcast episode with Noam Brown from Meta AI where we discuss his work on achieving human-level performance on poker and Diplomacy, as well as the power of spending compute at inference time! submitted by /u/thejashGI [link] [comments]  ( 41 min )
    How will AGI systems create fitness functions for hard problems?
    It seems as though the training of many AI systems involve one or more of the following: presenting huge amounts of example data spending a large amount of effort to manually grade efforts of a system creating carefully crafted fitness measures for a specific domain problem. If an AGI is presented with a difficult problem, how can such a system know that its answers are good? If a good simulation is available, the system can exercise its answers and evaluate against crude fitness functions, but if a problem is novel, no simulator will exist. At this point, is there any other option than 3 (experts craft a fitness function)? Having the AGI choose its now fitness functions has the exact same limitation. If 3 is the only option, how will AGI teach itself beyond the sphere of human knowledge? submitted by /u/bwootton [link] [comments]  ( 41 min )
    AI Project ideas
    Hi, I am thinking of ideas for a AI based project for my degree. I wish to use Machine learning, computer vision or robotics in gaming and was wondering if anyone had any good ideas. Preferably ideas that are somewhat scalable. I'm struggling to think of good gaming related ideas without explicitly creating a game myself which I don't want to do ​ Any ideas would be greatly appreciated. :) submitted by /u/Shachin2_2 [link] [comments]  ( 41 min )
    AI Dream 157 - MASTERPIECE - PART 8 TEASER - 2K SUBS CELEBRATION! 🥳🎉 - A...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Suggestions for learning AI
    I'm not sure what to google. As a programmer, I'm of the mind that if I'm not incorporating AI into my workflow, I'm going to be at a disadvantage very soon. What should I be learning, that would benefit me in that directive? I'd ask Chat GPT, but it's at Capacity, right now. submitted by /u/BenZed [link] [comments]  ( 41 min )
    AI picture generator
    Hey there, I have been experimenting with Midjourney and Dall•e 2 but I wanted to know if there are more AI picture generators besides these two I can use. Just let me know I would appreciate it! submitted by /u/NNRRYYNN [link] [comments]  ( 41 min )
    [Research] Seeking lightweight method to transfer/change skin tone in human images on CPU - any suggestions?
    I have been working on a project that involves altering the skin tone of human images. However, the methods I have come across so far either don't produce quality results or are too heavy for the limited computational resources available to me. Therefore, I am reaching out to the community for suggestions on a lightweight method for skin-tone transfer that is capable of running on a CPU. If you have any ideas or recommendations, please share them in the comments below. Your input would be greatly appreciated in helping me find a solution to this challenge. submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    ControlNet Installation In Stable Diffusion! Fantastic Extension!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    OpenAI CEO Sam Altman said ChatGPT is 'cool,' but it's a 'horrible product'
    submitted by /u/ssigea [link] [comments]  ( 43 min )
    Does anyone know what app re touched this image?
    submitted by /u/Ranwell13 [link] [comments]  ( 40 min )
    I asked two different AI models to create an anime girl with the same prompt.
    Which one did better here? Midjourney 2. DALL-E This is not a comparison between the two, I just found it interesting how these two create different results. https://preview.redd.it/y20186nlk6ia1.png?width=1263&format=png&auto=webp&s=a76b3d5d1af75ab1f89756fcfd0738b912df5200 submitted by /u/Aaryan7M [link] [comments]  ( 41 min )
    1 Million People Can’t Be Wrong: New Bing Is Taking Over Search!
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    All of this happening in AI. (14/02/2023)
    Hello humans. This is AI Daily, helping you stay updated on AI in less than 5 minutes. What’s happening in AI - As ChatGPT hype hits fever pitch, Neeva launches its generative AI search engine internationally Launched in the US in January, it is pitching as an “authentic, real-time AI search.” The search engine Neeva wants to replace the familiar “10 blue links” in search results with something more fitting for the modern AI age. The search engine first launched as a subscription-only search engine but now also supports free tire with certain limitations. AI chatbots are coming to search engines – can you trust the results? Three of the world’s biggest search engines announced that they are integrating ChatGPT-like technology into search engines, allowing people to get direct answe…  ( 44 min )
    The Journey of Pure Consciousness : AI Generated Story
    submitted by /u/spacesluts [link] [comments]  ( 40 min )
    An AI recently piloted a Lockheed Martin aircraft for over 17 hours during a test.
    submitted by /u/Dalembert [link] [comments]  ( 43 min )
    AI Trick For Social Media Content? AppScript + GPT3 🤫
    submitted by /u/JimZerChapirov [link] [comments]  ( 41 min )
    Using OpenAI to repurpose content for social media
    ​ Hello folks, It's crazy how versatile Open AI and GPT-3 are. I want to share about a project that I'm working on called Elephas (It's a Mac writing assistant powered by AI) Many of my users had been asking for the ability to repurpose their existing blog and newsletter content into social media posts. They are mostly busy content writers so this can be really useful to them in their day-to-day work. So I tried a simple prompt - "Summarize this for a tweet" I took the content from an OpenAI Blog and summarized it into a tweet. https://preview.redd.it/bibe0272t5ia1.png?width=800&format=png&auto=webp&s=d9e19bd4c6fcddb380f02d898c1fe62f635a816f Next, I tried another prompt - "Summarize this into a LinkedIn post" And that worked alright as well. https://preview.redd.it/urpn70wht5ia1.png?width=800&format=png&auto=webp&s=cdcdbc0ea637f783e19a718731ad297bedc2c5b5 Finally, I tried this prompt - "Summarize this into a Facebook post." https://preview.redd.it/8pkdbxxjt5ia1.png?width=800&format=png&auto=webp&s=4dca4b6a5d8fc267413d25701804ffb496e33aa6 These prompts worked well so I decided to integrate them into my Mac app, and the users loved it. Here is the final demo of how it works inside my app - ​ https://reddit.com/link/11260a8/video/50w0cz1vt5ia1/player It can be difficult to copy and paste the content into the playground. If you have a Mac and want to do this more straightforward way then please try out my app Elephas Do share your feedback. Thanks submitted by /u/juliarmg [link] [comments]  ( 42 min )
    The Most Detailed & Fantasy-inspired Female Dark Style Portrait By James Turrell - Photoreal & High Resolution!
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    Vietnam (Asia) develops AI technology industry, now at 55th globally and 54 points. Global average is at 44 points.
    submitted by /u/dannylenwinn [link] [comments]  ( 40 min )
    AI Writes a whole book Unveiling their Master Plan to Rule the World
    submitted by /u/Significant-Answer-1 [link] [comments]  ( 43 min )
    AI generated classic
    submitted by /u/Impressive_Hat9961 [link] [comments]  ( 40 min )
    AI music generator that you can control?
    There’s no other way for me to describe it so I’ll say it like it is. I have symphonies of music in my head. I think in music. My problem is I have a hand disability so I can’t produce the music I create in any tangible way. I can hum the music in parts but I was wondering… Is there an AI music generator where I can hum a melody, tell it what instrument to use, and it will produce that melody in that instrument for me so I can slowly build up my entire song? I’ve been able to think in music like this since I was 12 but I could never show anyone. I want to show them now that AI has become powerful enough. Any help is gratefully appreciated. submitted by /u/ZephyrBrightmoon [link] [comments]  ( 42 min )
    Build no-code AI tools that integrate with all your apps like Slack, Google Docs, etc...
    submitted by /u/iiamus [link] [comments]  ( 41 min )
  • Open

    Google Research, 2022 & beyond: Robotics
    Posted by Kendra Byrne, Senior Product Manager, and Jie Tan, Staff Research Scientist, Robotics at Google (This is Part 6 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) Within our lifetimes, we will see robotic technologies that can help with everyday activities, enhancing human productivity and quality of life. Before robotics can be broadly useful in helping with practical day-to-day tasks in people-centered spaces — spaces designed for people, not machines — they need to be able to safely & competently provide assistance to people. In 2022, we focused on challenges that come with enabling robots to be more helpful to people: 1) allowing robots and humans to communicate more efficiently and nat…  ( 95 min )
  • Open

    Building AI chatbots using Amazon Lex and Amazon Kendra for filtering query results based on user context
    Amazon Kendra is an intelligent search service powered by machine learning (ML). It indexes the documents stored in a wide range of repositories and finds the most relevant document based on the keywords or natural language questions the user has searched for. In some scenarios, you need the search results to be filtered based on […]  ( 12 min )
    Measure the Business Impact of Amazon Personalize Recommendations
    We’re excited to announce that Amazon Personalize now lets you measure how your personalized recommendations can help you achieve your business goals. After specifying the metrics that you want to track, you can identify which campaigns and recommenders are most impactful and understand the impact of recommendations on your business metrics. All customers want to […]  ( 10 min )
  • Open

    DSC Weekly 14 February 2023 – The AI Wars
    Announcements The AI Wars The stable release of OpenAI’s chatbot, ChatGPT, over two weeks ago saw nearly unanimous praise for its human-like chat capabilities and its natural-sounding responses to fairly complex inputs. The chatbot is raising ethical concerns over AI-generated written content, as its capabilities are far beyond the simple input and response of the… Read More »DSC Weekly 14 February 2023 – The AI Wars The post DSC Weekly 14 February 2023 – The AI Wars appeared first on Data Science Central.  ( 20 min )
    Ten Tips To Strengthen Your Cloud Database
    Unfortunately, COVID-19 hit us all individually and on corporate levels when the world economy was thriving. Among other significant measures that were taken to maintain the likelihood of businesses and their operations to continue, the demand for cloud-based remote access tools arose significantly. But no matter what size of the company was in question, business… Read More »Ten Tips To Strengthen Your Cloud Database The post Ten Tips To Strengthen Your Cloud Database appeared first on Data Science Central.  ( 22 min )
    Search Engines vs Synthesis Engines: The Future of Search
    With the announcements last week from Microsoft and openAI, we are now all actively discussing the future of search Here are some key takeaways as I interpret them: More interestingly, Balaji Srinivasan shared an interesting idea: search engines could evolve into synthesis engines. Through prompt engineering, you can provide a sequence that composes a complex… Read More »Search Engines vs Synthesis Engines: The Future of Search The post Search Engines vs Synthesis Engines: The Future of Search appeared first on Data Science Central.  ( 19 min )
    What does the Future of Accounting Look Like?
    Innovation in the 21st century is reaching the sky, and new development is happening in the world every time. A simple look near us speaks volumes of humans’ progress in the last two decades. With technological enhancement embedded into every facet of life, accounting is no exception. Various accounting has undergone drastic changes thanks to… Read More »What does the Future of Accounting Look Like? The post What does the Future of Accounting Look Like? appeared first on Data Science Central.  ( 21 min )
  • Open

    New enthusiast to the field, looking for connections
    Hey RL experts & enthusiasts! For the last few years, I've been working on a bot to play a real MMORPG as a hobby project (in a closed private server). I think the general idea of AI in games is fascinating. So far in my bot, I've spent a lot of time building up some foundational architecture, including reverse engineering the game. I'm finally at the point where I'm implementing intelligence for a single agent. Right now, I'm using GOFAI to handle tasks such as picking which monsters to attack, which items to pick up, where to walk, etc. Long term, I plan to control teams of agents for multiplayer game modes like Capture the Flag. I've been reading about AI/machine learning/deep learning for the last few years. Over the last few months I've read Francois Chollet's Deep Learning with Python book, read a few introductory RL blog posts, followed though OpenAi's "Spinning Up in Deep RL", and just finished Deepmind's Deep RL lecture series on YouTube. I'm definitely a newbie in the reinforcement learning world, but I'm starting to get familiar with the terms and algorithms; I've even implemented a simple value iteration agent in one of the OpenAI Gym's games! I know I have a long way to go for my long-term goals. It seems that using RL in the domain of a real open-ended MMORPG is not even something that we know how to do well. I'm posting here to find an "RL mentor" or at least to connect with people who are into the same types of things. Unfortunately, I don't know anyone who's an expert in this field. I'm looking for someone to ask questions or bounce ideas off of. I'm working on this stuff daily and having someone to chat with works best with my motivation, creativity, and working-style. If that sounds interesting to you, please do reach out! submitted by /u/SandSnip3r [link] [comments]  ( 43 min )
    Miniworld is now mature within the Farama Foundation
    Miniworld - a minimalistic 3D interior environment simulator for reinforcement learning & robotics research that allows environments to be easily edited - has now reached the mature inside Farama. You can check out the documentation at https://miniworld.farama.org, and the release notes for all the changes we’ve made to the project at https://github.com/Farama-Foundation/Miniworld/releases/tag/2.0.1. submitted by /u/jkterry1 [link] [comments]  ( 41 min )
    TD3 model loading size mismatch help
    I trained and saved a stable baselines3 TD3 model on custom environment. When trying to load there are size mismatches for both actor and critic weights and biases. One of the errors is size mismatch for actor.mu.4.weight: copying a param with shape torch.Size([4, 300]) from checkpoint, the shape in current model is torch Size(304, 300]) All of the errors are off by 300. I am able to load PPO models just fine and if I stop training TD3 after 1k steps while it's predictions are still random it will load. Does anyone have any ideas how i can correctly load the model? submitted by /u/actualsen [link] [comments]  ( 41 min )
    Need help with Q Learning
    Hello guys, I have been trying to simulate a Q Learning environment specifically the robot in a maze example, but I have run into a problem regarding the updating process of the Q Values in the Q Table. Specifically, how to update the Q-Value of the terminal state where the agent dies, as I cannot figure out what to put in the "QValue(next-state)" part of the equation given below. Till now, I figured the QValue(next-state) when agent is at terminal state (where it dies) would be the QValue of the starting position where agent respawns. I was wondering if this is correct by the process or is there a way to do this correctly? For example, in the image given, the agent will go to red area which kills it, now the new Q will be decided using the bellman equation, but the bellman equation also requires the QValue of the next state. I am confused to which state should be the next state, till now I simply used the respawn point as next state when in terminal state, however is this algorithmically correct or is there someting I'm missing? ​ Example of agent going to terminal state. ​ ​ Bellman Equation for updating QValue in QTable. submitted by /u/auzuha [link] [comments]  ( 42 min )
    Training an AI for Tigris and Euphrates
    submitted by /u/0b01 [link] [comments]  ( 42 min )
    Exploration in Complex Environments
    Hi there, I have a MARL problem where the reward is very sparse, and a lot of cooperation is required for the agents to get any reward. At the moment, I'm using a simple linearly decaying epsilon-greedy policy along with an adapted version of QMIX (also thinking of adding WQMIX to the mix), but the epsilon greedy policy simply doesn't explore enough to accidentally stumble onto the required behaviour before epsilon is too low to ever discover everything. I have tried using 1mil and 2mil timestep decay periods, to no avail. I'm looking at maybe adding a curiosity module, but before I jump into that I was wondering if there were simpler methods? I found a paper on VDBE exploration that I'm going to check out, but I also had an idea that probably won't work but I'm not sure why: Essentially an epsilon greedy policy, but in stead of decaying linearly or exponentially, how about decaying epsilon based on cumulative reward? So start out exploring say 80% of the time, until the cumulative reward reaches a predetermined positive value, then decrementing epsilon and keeping it at the lower value until a reward threshold is reached again. My thought is that the agents wil explore until they solve the problem once or twice, which may take very long, but as soon as they start figuring out how to solve the problem they will begin exploring less and less. It makes sense in my head, but many of the bugs I've fixed were also stuff that made sense in my head. Any thoughts on this? As well as references or ideas for other ways to allow my agents to explore successfully. submitted by /u/Grym7er [link] [comments]  ( 46 min )
    Current SOTA for offline RL and sim2sim/sim2real?
    Hey I am dealing with a problem with 3 stages of: 1. moving a vehicle simulation from low level cheap simulation 2. to high fidelity simulaton 3. to real vehicle which I will have small amount of data to collect I though of using offline RL and fine tuning along the way when talking about sim2real its mostly the dynamics (equations and parameters) and not visual observations What is the sota in this areas of offline RL and sim2real? Thank you submitted by /u/What_Did_It_Cost_E_T [link] [comments]  ( 41 min )
  • Open

    Burr distribution
    Irving Burr came up with a set of twelve probability distributions known as Burr I, Burr II, …, Burr XII. The last of these is by far the best known, and so the Burr XII distribution is often referred to simply as the Burr distribution. Cumulative density functions (CDFs) of probability distributions don’t always have […] Burr distribution first appeared on John D. Cook.  ( 5 min )
  • Open

    3D Creators Share Art From the Heart This Week ‘In the NVIDIA Studio’
    Love and creativity are in the air this Valentine’s Day In the NVIDIA Studio, as 3D artist Molly Brady presents a parody scene inspired by the iconic The Birth of Venus (Redux) painting by Sando Botticelli.  ( 7 min )
  • Open

    Training a model to produce 3D head models from text
    This may be a bit of a naive question. I have a very large database of highly descriptive digital 3D models of head scans. Not MRI scans, just a model of heads including the hair and all the facial features. Each one of these models is correlated with a list of 50 textual features describing the physical features of the head models. They are all in the exact same format. Kind of like a spec sheet for each head model. I would like to train a model that trains on that data and learns how to produce these 3D models when given a new textual spec sheet. I'm aware that the dataset needed to create something like that may be huge. Can anyone provide an overview of the process of writing something like that? Which type of network should i use and which libraries should i look into? (Preferably in python) Thanks submitted by /u/itreallyreallydoesnt [link] [comments]  ( 42 min )
  • Open

    CLAWSAT: Towards Both Robust and Accurate Code Models. (arXiv:2211.11711v4 [cs.LG] UPDATED)
    We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models. Different from existing works, we show that code obfuscation, a standard code transformation operation, provides novel means to generate complementary `views' of a code that enable us to achieve both robust and accurate code models. To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models. Specifically, we first adopt adversarial codes as robustness-promoting views in CL at the self-supervised pre-training phase. This yields improved robustness and transferability for downstream tasks. Next, at the supervised fine-tuning stage, we show that adversarial training with a proper temporally-staggered schedule of adversarial code generation can further improve robustness and accuracy of the pre-trained code model. Built on the above two modules, we develop CLAWSAT, a novel self-supervised learning (SSL) framework for code by integrating $\underline{\textrm{CL}}$ with $\underline{\textrm{a}}$dversarial vie$\underline{\textrm{w}}$s (CLAW) with $\underline{\textrm{s}}$taggered $\underline{\textrm{a}}$dversarial $\underline{\textrm{t}}$raining (SAT). On evaluating three downstream tasks across Python and Java, we show that CLAWSAT consistently yields the best robustness and accuracy ($\textit{e.g.}$ 11$\%$ in robustness and 6$\%$ in accuracy on the code summarization task in Python). We additionally demonstrate the effectiveness of adversarial learning in CLAW by analyzing the characteristics of the loss landscape and interpretability of the pre-trained models.  ( 2 min )
    DASH: A Distributed and Parallelizable Algorithm for Size-Constrained Submodular Maximization. (arXiv:2206.09563v3 [cs.DS] UPDATED)
    MapReduce (MR) frameworks for maximizing monotone, submodular functions subject to a cardinality constraint (SMCC) have currently only been shown to work with linear-adaptive (non-parallelizable) algorithms, that require large number of distributions in order to utilize the available processors, thus resulting in severe restrictions on the cardinality constraint in addition to limited scalability. Low-adaptive algorithms do not currently satisfy the requirements of these distributed MR frameworks, thereby limiting their performance. We study the SMCC problem in a distributed setting and propose the first MR algorithms with sublinear adaptive complexity. Our algorithms, R-DASH, T-DASH and G-DASH provide $0.316-\varepsilon$, $3/8 -\varepsilon$, and $1 - 1/e -\varepsilon$ approximation ratios, respectively, with nearly optimal adaptive complexity and nearly linear time complexity. Additionally, we provide a framework to increase, under some mild assumptions, the maximum permissible cardinality constraint from $O( n / \ell^2)$ of prior MR algorithms to $O( n / \ell )$, where $n$ is the data size and $\ell$ is the number of machines; under a stronger condition on the objective function, we increase the maximum constraint value to $n$. Finally, we provide empirical evidence to demonstrate that our sublinear-adaptive, distributed algorithms provide orders of magnitude faster runtime compared to current state-of-the-art distributed algorithms.  ( 2 min )
    Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond. (arXiv:2209.06177v2 [cs.SI] UPDATED)
    Homophily is a graph property describing the tendency of edges to connect similar nodes; the opposite is called heterophily. It is often believed that heterophilous graphs are challenging for standard message-passing graph neural networks (GNNs), and much effort has been put into developing efficient methods for this setting. However, there is no universally agreed-upon measure of homophily in the literature. In this work, we show that commonly used homophily measures have critical drawbacks preventing the comparison of homophily levels across different datasets. For this, we formalize desirable properties for a proper homophily measure and verify which measures satisfy which properties. In particular, we show that a measure that we call adjusted homophily satisfies more desirable properties than other popular homophily measures while being rarely used in graph learning literature. Then, we go beyond the homophily-heterophily dichotomy and propose a new characteristic allowing one to further distinguish different sorts of heterophily. The proposed label informativeness (LI) characterizes how much information a neighbor's label provides about a node's label. We analyze LI via the same theoretical framework and show that it is comparable across different datasets. We also observe empirically that LI better agrees with GNN performance compared to homophily measures, which confirms that it is a useful characteristic of the graph structure.  ( 2 min )
    The Survival Bandit Problem. (arXiv:2206.03019v3 [cs.LG] UPDATED)
    We study the survival bandit problem, a variant of the multi-armed bandit problem with a constraint on the cumulative reward; at each time step, the agent receives a reward in [-1, 1] and if the cumulative reward becomes lower than a preset threshold, the procedure stops, and this phenomenon is called ruin. To our knowledge, this is the first paper studying a framework where the ruin might occur but not always. We first discuss that no policy can achieve a sublinear regret as defined in the standard multi-armed bandit problem, because a single pull of an arm may increase significantly the risk of ruin. Instead, we establish the framework of Pareto-optimal policies, which is a class of policies whose cumulative reward for some instance cannot be improved without sacrificing that for another instance. To this end, we provide tight lower bounds on the probability of ruin, as well as matching policies called EXPLOIT. Finally, using a doubling trick over an EXPLOIT policy, we display a Pareto-optimal policy in the case of {-1, 0, 1} rewards, giving an answer to the open problem by Perotto et al. (2019).  ( 2 min )
    The Surprising Effectiveness of Equivariant Models in Domains with Latent Symmetry. (arXiv:2211.09231v2 [cs.LG] UPDATED)
    Extensive work has demonstrated that equivariant neural networks can significantly improve sample efficiency and generalization by enforcing an inductive bias in the network architecture. These applications typically assume that the domain symmetry is fully described by explicit transformations of the model inputs and outputs. However, many real-life applications contain only latent or partial symmetries which cannot be easily described by simple transformations of the input. In these cases, it is necessary to learn symmetry in the environment instead of imposing it mathematically on the network architecture. We discover, surprisingly, that imposing equivariance constraints that do not exactly match the domain symmetry is very helpful in learning the true symmetry in the environment. We differentiate between extrinsic and incorrect symmetry constraints and show that while imposing incorrect symmetry can impede the model's performance, imposing extrinsic symmetry can actually improve performance. We demonstrate that an equivariant model can significantly outperform non-equivariant methods on domains with latent symmetries both in supervised learning and in reinforcement learning for robotic manipulation and control problems.  ( 2 min )
    Towards Reliable Neural Specifications. (arXiv:2210.16114v3 [cs.LG] UPDATED)
    Having reliable specifications is an unavoidable challenge in achieving verifiable correctness, robustness, and interpretability of AI systems. Existing specifications for neural networks are in the paradigm of data as specification. That is, the local neighborhood centering around a reference input is considered to be correct (or robust). While existing specifications contribute to verifying adversarial robustness, a significant problem in many research domains, our empirical study shows that those verified regions are somewhat tight, and thus fail to allow verification of test set inputs, making them impractical for some real-world applications. To this end, we propose a new family of specifications called neural representation as specification, which uses the intrinsic information of neural networks - neural activation patterns (NAPs), rather than input data to specify the correctness and/or robustness of neural network predictions. We present a simple statistical approach to mining neural activation patterns. To show the effectiveness of discovered NAPs, we formally verify several important properties, such as various types of misclassifications will never happen for a given NAP, and there is no ambiguity between different NAPs. We show that by using NAP, we can verify a significant region of the input space, while still recalling 84% of the data on MNIST. Moreover, we can push the verifiable bound to 10 times larger on the CIFAR10 benchmark. Thus, we argue that NAPs can potentially be used as a more reliable and extensible specification for neural network verification.  ( 2 min )
    PDE-LEARN: Using Deep Learning to Discover Partial Differential Equations from Noisy, Limited Data. (arXiv:2212.04971v2 [cs.LG] UPDATED)
    In this paper, we introduce PDE-LEARN, a novel deep learning algorithm that can identify governing partial differential equations (PDEs) directly from noisy, limited measurements of a physical system of interest. PDE-LEARN uses a Rational Neural Network, $U$, to approximate the system response function and a sparse, trainable vector, $\xi$, to characterize the hidden PDE that the system response function satisfies. Our approach couples the training of $U$ and $\xi$ using a loss function that (1) makes $U$ approximate the system response function, (2) encapsulates the fact that $U$ satisfies a hidden PDE that $\xi$ characterizes, and (3) promotes sparsity in $\xi$ using ideas from iteratively reweighted least-squares. Further, PDE-LEARN can simultaneously learn from several data sets, allowing it to incorporate results from multiple experiments. This approach yields a robust algorithm to discover PDEs directly from realistic scientific data. We demonstrate the efficacy of PDE-LEARN by identifying several PDEs from noisy and limited measurements.  ( 2 min )
    Hierarchical classification at multiple operating points. (arXiv:2210.10929v2 [cs.LG] UPDATED)
    Many classification problems consider classes that form a hierarchy. Classifiers that are aware of this hierarchy may be able to make confident predictions at a coarse level despite being uncertain at the fine-grained level. While it is generally possible to vary the granularity of predictions using a threshold at inference time, most contemporary work considers only leaf-node prediction, and almost no prior work has compared methods at multiple operating points. We present an efficient algorithm to produce operating characteristic curves for any method that assigns a score to every class in the hierarchy. Applying this technique to evaluate existing methods reveals that top-down classifiers are dominated by a naive flat softmax classifier across the entire operating range. We further propose two novel loss functions and show that a soft variant of the structured hinge loss is able to significantly outperform the flat baseline. Finally, we investigate the poor accuracy of top-down classifiers and demonstrate that they perform relatively well on unseen classes. Code is available online at https://github.com/jvlmdr/hiercls.  ( 2 min )
    Latent State Marginalization as a Low-cost Approach for Improving Exploration. (arXiv:2210.00999v2 [cs.LG] UPDATED)
    While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.  ( 2 min )
    Closed-loop Analysis of Vision-based Autonomous Systems: A Case Study. (arXiv:2302.04634v1 [cs.CV] CROSS LISTED)
    Deep neural networks (DNNs) are increasingly used in safety-critical autonomous systems as perception components processing high-dimensional image data. Formal analysis of these systems is particularly challenging due to the complexity of the perception DNNs, the sensors (cameras), and the environment conditions. We present a case study applying formal probabilistic analysis techniques to an experimental autonomous system that guides airplanes on taxiways using a perception DNN. We address the above challenges by replacing the camera and the network with a compact probabilistic abstraction built from the confusion matrices computed for the DNN on a representative image data set. We also show how to leverage local, DNN-specific analyses as run-time guards to increase the safety of the overall system. Our findings are applicable to other autonomous systems that use complex DNNs for perception.  ( 2 min )
    Slowly Changing Adversarial Bandit Algorithms are Efficient for Discounted MDPs. (arXiv:2205.09056v2 [cs.LG] UPDATED)
    Reinforcement learning generalizes bandit problems with additional difficulties on longer planning horizon and unknown transition kernel. We show that, under some mild assumptions, *any* slowly changing adversarial bandit algorithm enjoys optimal regret in adversarial bandits can achieve optimal (in the dependency of $T$) expected regret in infinite-horizon discounted MDPs, without the presence of Bellman backups. The slowly changing property required by our generalization is mild, which is also marked by the online Markov decision process literature. We also examine the applicability of our reduction to a well-known adversarial bandit algorithm, EXP3.  ( 2 min )
    Approximate Vanishing Ideal Computations at Scale. (arXiv:2207.01236v2 [cs.LG] UPDATED)
    The vanishing ideal of a set of points $X = \{\mathbf{x}_1, \ldots, \mathbf{x}_m\}\subseteq \mathbb{R}^n$ is the set of polynomials that evaluate to $0$ over all points $\mathbf{x} \in X$ and admits an efficient representation by a finite subset of generators. In practice, to accommodate noise in the data, algorithms that construct generators of the approximate vanishing ideal are widely studied but their computational complexities remain expensive. In this paper, we scale up the oracle approximate vanishing ideal algorithm (OAVI), the only generator-constructing algorithm with known learning guarantees. We prove that the computational complexity of OAVI is not superlinear, as previously claimed, but linear in the number of samples $m$. In addition, we propose two modifications that accelerate OAVI's training time: Our analysis reveals that replacing the pairwise conditional gradients algorithm, one of the solvers used in OAVI, with the faster blended pairwise conditional gradients algorithm leads to an exponential speed-up in the number of features $n$. Finally, using a new inverse Hessian boosting approach, intermediate convex optimization problems can be solved almost instantly, improving OAVI's training time by multiple orders of magnitude in a variety of numerical experiments.  ( 2 min )
    Designing Robust Transformers using Robust Kernel Density Estimation. (arXiv:2210.05794v2 [cs.LG] UPDATED)
    Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering other practical issues, such as robustness to contaminated samples. Recent work by Nguyen et al., (2022) has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on kernel density estimation (KDE). This motivates us to leverage a set of robust kernel density estimation methods for alleviating the issue of data contamination. Specifically, we introduce a series of self-attention mechanisms that can be incorporated into different Transformer architectures and discuss the special properties of each method. We then perform extensive empirical studies on language modeling and image classification tasks. Our methods demonstrate robust performance in multiple scenarios while maintaining competitive results on clean datasets.  ( 2 min )
    Gaussian Pre-Activations in Neural Networks: Myth or Reality?. (arXiv:2205.12379v3 [cs.LG] UPDATED)
    The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?  ( 2 min )
    Model reduction for the material point method via an implicit neural representation of the deformation map. (arXiv:2109.12390v5 [cs.LG] UPDATED)
    This work proposes a model-reduction approach for the material point method on nonlinear manifolds. Our technique approximates the $\textit{kinematics}$ by approximating the deformation map using an implicit neural representation that restricts deformation trajectories to reside on a low-dimensional manifold. By explicitly approximating the deformation map, its spatiotemporal gradients -- in particular the deformation gradient and the velocity -- can be computed via analytical differentiation. In contrast to typical model-reduction techniques that construct a linear or nonlinear manifold to approximate the (finite number of) degrees of freedom characterizing a given spatial discretization, the use of an implicit neural representation enables the proposed method to approximate the $\textit{continuous}$ deformation map. This allows the kinematic approximation to remain agnostic to the discretization. Consequently, the technique supports dynamic discretizations -- including resolution changes -- during the course of the online reduced-order-model simulation. To generate $\textit{dynamics}$ for the generalized coordinates, we propose a family of projection techniques. At each time step, these techniques: (1) Calculate full-space kinematics at quadrature points, (2) Calculate the full-space dynamics for a subset of `sample' material points, and (3) Calculate the reduced-space dynamics by projecting the updated full-space position and velocity onto the low-dimensional manifold and tangent space, respectively. We achieve significant computational speedup via hyper-reduction that ensures all three steps execute on only a small subset of the problem's spatial domain. Large-scale numerical examples with millions of material points illustrate the method's ability to gain an order of magnitude computational-cost saving -- indeed $\textit{real-time simulations}$ -- with negligible errors.  ( 3 min )
    Class Token and Knowledge Distillation for Multi-head Self-Attention Speaker Verification Systems. (arXiv:2111.03842v2 [eess.AS] UPDATED)
    This paper explores three novel approaches to improve the performance of speaker verification (SV) systems based on deep neural networks (DNN) using Multi-head Self-Attention (MSA) mechanisms and memory layers. Firstly, we propose the use of a learnable vector called Class token to replace the average global pooling mechanism to extract the embeddings. Unlike global average pooling, our proposal takes into account the temporal structure of the input what is relevant for the text-dependent SV task. The class token is concatenated to the input before the first MSA layer, and its state at the output is used to predict the classes. To gain additional robustness, we introduce two approaches. First, we have developed a Bayesian estimation of the class token. Second, we have added a distilled representation token for training a teacher-student pair of networks using the Knowledge Distillation (KD) philosophy, which is combined with the class token. This distillation token is trained to mimic the predictions from the teacher network, while the class token replicates the true label. All the strategies have been tested on the RSR2015-Part II and DeepMine-Part 1 databases for text-dependent SV, providing competitive results compared to the same architecture using the average pooling mechanism to extract average embeddings.  ( 2 min )
    Action Dynamics Task Graphs for Learning Plannable Representations of Procedural Tasks. (arXiv:2302.05330v1 [cs.CV])
    Given video demonstrations and paired narrations of an at-home procedural task such as changing a tire, we present an approach to extract the underlying task structure -- relevant actions and their temporal dependencies -- via action-centric task graphs. Learnt structured representations from our method, Action Dynamics Task Graphs (ADTG), can then be used for understanding such tasks in unseen videos of humans performing them. Furthermore, ADTG can enable providing user-centric guidance to humans in these tasks, either for performing them better or for learning new tasks. Specifically, we show how ADTG can be used for: (1) tracking an ongoing task, (2) recommending next actions, and (3) planning a sequence of actions to accomplish a procedural task. We compare against state-of-the-art Neural Task Graph method and demonstrate substantial gains on 18 procedural tasks from the CrossTask dataset, including 30.1% improvement in task tracking accuracy and 20.3% accuracy gain in next action prediction.  ( 2 min )
    Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness. (arXiv:2302.05404v1 [stat.ML])
    In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (\emph{e.g.,} projected norm) rather than valid metrics (\emph{e.g.,} $L_2$ norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong $L_2$ error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem.  ( 2 min )
    Key Design Choices for Double-Transfer in Source-Free Unsupervised Domain Adaptation. (arXiv:2302.05379v1 [cs.LG])
    Fine-tuning and Domain Adaptation emerged as effective strategies for efficiently transferring deep learning models to new target tasks. However, target domain labels are not accessible in many real-world scenarios. This led to the development of Unsupervised Domain Adaptation (UDA) methods, which only employ unlabeled target samples. Furthermore, efficiency and privacy requirements may also prevent the use of source domain data during the adaptation stage. This challenging setting, known as Source-Free Unsupervised Domain Adaptation (SF-UDA), is gaining interest among researchers and practitioners due to its potential for real-world applications. In this paper, we provide the first in-depth analysis of the main design choices in SF-UDA through a large-scale empirical study across 500 models and 74 domain pairs. We pinpoint the normalization approach, pre-training strategy, and backbone architecture as the most critical factors. Based on our quantitative findings, we propose recipes to best tackle SF-UDA scenarios. Moreover, we show that SF-UDA is competitive also beyond standard benchmarks and backbone architectures, performing on par with UDA at a fraction of the data and computational cost. In the interest of reproducibility, we include the full experimental results and code as supplementary material.  ( 2 min )
    Online Distribution Shift Detection via Recency Prediction. (arXiv:2211.09916v2 [cs.RO] UPDATED)
    When deploying modern machine learning-enabled robotic systems in high-stakes applications, detecting distribution shift is critical. However, most existing methods for detecting distribution shift are not well-suited to robotics settings, where data often arrives in a streaming fashion and may be very high-dimensional. In this work, we present an online method for detecting distribution shift with guarantees on the false positive rate - i.e., when there is no distribution shift, our system is very unlikely (with probability $< \epsilon$) to falsely issue an alert; any alerts that are issued should therefore be heeded. Our method is specifically designed for efficient detection even with high dimensional data, and it empirically achieves up to 11x faster detection on realistic robotics settings compared to prior work while maintaining a low false negative rate in practice (whenever there is a distribution shift in our experiments, our method indeed emits an alert).
    Mathematical Theory of Bayesian Statistics for Unknown Information Source. (arXiv:2206.05630v5 [cs.LG] UPDATED)
    In statistical inference, uncertainty is unknown and all models are wrong. That is to say, a person who makes a statistical model and a prior distribution is simultaneously aware that both are fictional candidates. To study such cases, statistical measures have been constructed, such as cross validation, information criteria, and marginal likelihood, however, their mathematical properties have not yet been completely clarified when statistical models are under- and over- parametrized. We introduce a place of mathematical theory of Bayesian statistics for unknown uncertainty, which clarifies general properties of cross validation, information criteria, and marginal likelihood, even if an unknown data-generating process is unrealizable by a model or even if the posterior distribution cannot be approximated by any normal distribution. Hence it gives a helpful standpoint for a person who cannot believe in any specific model and prior. This paper consists of three parts. The first is a new result, whereas the second and third are well-known previous results with new experiments. We show there exists a more precise estimator of the generalization loss than leave-one-out cross validation, there exists a more accurate approximation of marginal likelihood than BIC, and the optimal hyperparameters for generalization loss and marginal likelihood are different.
    On the Interventional Kullback-Leibler Divergence. (arXiv:2302.05380v1 [cs.LG])
    Modern machine learning approaches excel in static settings where a large amount of i.i.d. training data are available for a given task. In a dynamic environment, though, an intelligent agent needs to be able to transfer knowledge and re-use learned components across domains. It has been argued that this may be possible through causal models, aiming to mirror the modularity of the real world in terms of independent causal mechanisms. However, the true causal structure underlying a given set of data is generally not identifiable, so it is desirable to have means to quantify differences between models (e.g., between the ground truth and an estimate), on both the observational and interventional level. In the present work, we introduce the Interventional Kullback-Leibler (IKL) divergence to quantify both structural and distributional differences between models based on a finite set of multi-environment distributions generated by interventions from the ground truth. Since we generally cannot quantify all differences between causal models for every finite set of interventional distributions, we propose a sufficient condition on the intervention targets to identify subsets of observed variables on which the models provably agree or disagree.  ( 2 min )
    ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training. (arXiv:2210.01738v2 [cs.LG] UPDATED)
    CLIP proved that aligning visual and language spaces is key to solving many vision tasks without explicit training, but required to train image and text encoders from scratch on a huge dataset. LiT improved this by only training the text encoder and using a pre-trained vision network. In this paper, we show that a common space can be created without any training at all, using single-domain encoders (trained with or without supervision) and a much smaller amount of image-text pairs. Furthermore, our model has unique properties. Most notably, deploying a new version with updated training samples can be done in a matter of seconds. Additionally, the representations in the common space are easily interpretable as every dimension corresponds to the similarity of the input to a unique entry in the multimodal dataset. Experiments on standard zero-shot visual benchmarks demonstrate the typical transfer ability of image-text models. Overall, our method represents a simple yet surprisingly strong baseline for foundation multi-modal models, raising important questions on their data efficiency and on the role of retrieval in machine learning.
    More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization. (arXiv:2209.12681v2 [cs.LG] UPDATED)
    In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines. Our code is available at https://github.com/PKU-RL/FOP-DMAC-MACPF.
    An Additive Instance-Wise Approach to Multi-class Model Interpretation. (arXiv:2207.03113v4 [cs.LG] UPDATED)
    Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system. A large number of interpreting methods focus on identifying explanatory input features, which generally fall into two main categories: attribution and selection. A popular attribution-based approach is to exploit local neighborhoods for learning instance-specific explainers in an additive manner. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, many selection-based methods directly optimize local feature distributions in an instance-wise training framework, thereby being capable of leveraging global information from other inputs. However, they can only interpret single-class predictions and many suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a framework for learning local explanations simultaneously for multiple target classes. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness with more compact and comprehensible explanations. We also demonstrate the capacity to select stable and important features through extensive experiments on various data sets and black-box model architectures.
    Project and Probe: Sample-Efficient Domain Adaptation by Interpolating Orthogonal Features. (arXiv:2302.05441v1 [cs.LG])
    Conventional approaches to robustness try to learn a model based on causal features. However, identifying maximally robust or causal features may be difficult in some scenarios, and in others, non-causal "shortcut" features may actually be more predictive. We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features with a small target dataset. Our approach, Project and Probe (Pro$^2$), first learns a linear projection that maps a pre-trained embedding onto orthogonal directions while being predictive of labels in the source dataset. The goal of this step is to learn a variety of predictive features, so that at least some of them remain useful after distribution shift. Pro$^2$ then learns a linear classifier on top of these projected features using a small target dataset. We theoretically show that Pro$^2$ learns a projection matrix that is optimal for classification in an information-theoretic sense, resulting in better generalization due to a favorable bias-variance tradeoff. Our experiments on four datasets, with multiple distribution shift settings for each, show that Pro$^2$ improves performance by 5-15% when given limited target data compared to prior methods such as standard linear probing.  ( 2 min )
    Deciphering the Language of Nature: A transformer-based language model for deleterious mutations in proteins. (arXiv:2110.14746v4 [q-bio.GN] UPDATED)
    Various machine-learning models, including deep neural network models, have already been developed to predict deleteriousness of missense (non-synonymous) mutations. Potential improvements to the current state of the art, however, may still benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches. Recent advances in the natural language processing field show transformer models-a type of deep neural network-to be particularly powerful at modeling sequence information with context dependence. In this study, we introduce MutFormer, a transformer-based model for the prediction of deleterious missense mutations, which uses reference and mutated protein sequences from the human genome as the primary features. MutFormer takes advantage of a combination of self-attention layers and convolutional layers to learn both long-range and short-range dependencies between amino acid mutations in a protein sequence. In this study, we first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants observed in human populations. We next examined different fine-tuning methods to successfully apply the model to deleteriousness prediction of missense mutations. Finally, we evaluated MutFormer's performance on multiple testing data sets. We found that MutFormer showed similar or improved performance over a variety of existing tools, including those that used conventional machine-learning approaches. We conclude that MutFormer successfully considers sequence features that are not explored in previous studies and could potentially complement existing computational predictions or empirically generated functional scores to improve our understanding of disease variants.  ( 2 min )
    Efficient and Accurate Learning of Mixtures of Plackett-Luce Models. (arXiv:2302.05343v1 [cs.LG])
    Mixture models of Plackett-Luce (PL) -- one of the most fundamental ranking models -- are an active research area of both theoretical and practical significance. Most previously proposed parameter estimation algorithms instantiate the EM algorithm, often with random initialization. However, such an initialization scheme may not yield a good initial estimate and the algorithms require multiple restarts, incurring a large time complexity. As for the EM procedure, while the E-step can be performed efficiently, maximizing the log-likelihood in the M-step is difficult due to the combinatorial nature of the PL likelihood function (Gormley and Murphy 2008). Therefore, previous authors favor algorithms that maximize surrogate likelihood functions (Zhao et al. 2018, 2020). However, the final estimate may deviate from the true maximum likelihood estimate as a consequence. In this paper, we address these known limitations. We propose an initialization algorithm that can provide a provably accurate initial estimate and an EM algorithm that maximizes the true log-likelihood function efficiently. Experiments on both synthetic and real datasets show that our algorithm is competitive in terms of accuracy and speed to baseline algorithms, especially on datasets with a large number of items.  ( 2 min )
    Q-Match: Self-supervised Learning by Matching Distributions Induced by a Queue. (arXiv:2302.05444v1 [cs.LG])
    In semi-supervised learning, student-teacher distribution matching has been successful in improving performance of models using unlabeled data in conjunction with few labeled samples. In this paper, we aim to replicate that success in the self-supervised setup where we do not have access to any labeled data during pre-training. We introduce our algorithm, Q-Match, and show it is possible to induce the student-teacher distributions without any knowledge of downstream classes by using a queue of embeddings of samples from the unlabeled dataset. We focus our study on tabular datasets and show that Q-Match outperforms previous self-supervised learning techniques when measuring downstream classification performance. Furthermore, we show that our method is sample efficient--in terms of both the labels required for downstream training and the amount of unlabeled data required for pre-training--and scales well to the sizes of both the labeled and unlabeled data.  ( 2 min )
    Watermarking Pre-trained Language Models with Backdooring. (arXiv:2210.07543v2 [cs.CL] UPDATED)
    Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.  ( 2 min )
    Active Learning from the Web. (arXiv:2210.08205v2 [cs.LG] UPDATED)
    Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to explicitly design and build the pool for each task. The challenge is that we cannot compute the acquisition scores of all data exhaustively due to the size of the pool. We propose an efficient method, Seafaring, to retrieve informative data in terms of active learning from the Web using a user-side information retrieval algorithm. In the experiments, we use the online Flickr environment as the pool for active learning. This pool contains more than ten billion images and is several orders of magnitude larger than the existing pools in the literature for active learning. We confirm that our method performs better than existing approaches of using a small unlabelled pool.  ( 2 min )
    Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making. (arXiv:2302.05430v1 [stat.ML])
    Smoothed online learning has emerged as a popular framework to mitigate the substantial loss in statistical and computational complexity that arises when one moves from classical to adversarial learning. Unfortunately, for some spaces, it has been shown that efficient algorithms suffer an exponentially worse regret than that which is minimax optimal, even when the learner has access to an optimization oracle over the space. To mitigate that exponential dependence, this work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space, and shows that an instantiation of Follow-the-Perturbed-Leader can attain low regret with the number of calls to the optimization oracle scaling optimally with respect to average regret. We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions, which has many applications in fields as diverse as econometrics and robotics.  ( 2 min )
    Controlling Large Language Models to Generate Secure and Vulnerable Code. (arXiv:2302.05319v1 [cs.CR])
    Large language models (LMs) are increasingly pretrained on massive corpora of open-source programs and applied to solve program synthesis tasks. However, a fundamental limitation of LMs is their unawareness of security and vulnerability during pretraining and inference. As a result, LMs produce secure or vulnerable programs with high uncertainty (e.g., around 60%/40% chances for GitHub Copilot according to a recent study). This greatly impairs LMs' usability, especially in security-sensitive scenarios. To address this limitation, this work formulates a new problem called controlled code generation, which allows users to input a boolean property into an LM to control if the LM generates secure or vulnerable code. We propose svGen, an effective and lightweight learning approach for solving controlled code generation. svGen leverages property-specific continuous vectors to steer program generation toward the given property, without altering the weights of the LM. svGen's training optimizes those continuous vectors by carefully applying specialized loss terms on different regions of code. Our extensive evaluation shows that svGen achieves strong control capability across various software vulnerabilities and LMs of different parameter sizes. For example, on 9 dangerous vulnerabilities, a state-of-the-art CodeGen LM with 2.7B parameters generates secure programs with a 57% chance. When we use svGen to control the LM to generate secure (resp., vulnerable) programs, the chance is significantly increased to 82% (resp., decreased to 35%).  ( 2 min )
    Unified Functional Hashing in Automatic Machine Learning. (arXiv:2302.05433v1 [cs.LG])
    The field of Automatic Machine Learning (AutoML) has recently attained impressive results, including the discovery of state-of-the-art machine learning solutions, such as neural image classifiers. This is often done by applying an evolutionary search method, which samples multiple candidate solutions from a large space and evaluates the quality of each candidate through a long training process. As a result, the search tends to be slow. In this paper, we show that large efficiency gains can be obtained by employing a fast unified functional hash, especially through the functional equivalence caching technique, which we also present. The central idea is to detect by hashing when the search method produces equivalent candidates, which occurs very frequently, and this way avoid their costly re-evaluation. Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently, and it is "unified" in that the same algorithm can hash arbitrary representations; e.g. compute graphs, imperative code, or lambda functions. As evidence, we show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery. Finally, we consider the effect of hash collisions, evaluation noise, and search distribution through empirical analysis. Altogether, we hope this paper may serve as a guide to hashing techniques in AutoML.  ( 2 min )
    XFL: A High Performace, Lightweighted Federated Learning Framework. (arXiv:2302.05076v1 [cs.LG])
    This paper introduces XFL, an industrial-grade federated learning project. XFL supports training AI models collaboratively on multiple devices, while utilizes homomorphic encryption, differential privacy, secure multi-party computation and other security technologies ensuring no leakage of data. XFL provides an abundant algorithms library, integrating a large number of pre-built, secure and outstanding federated learning algorithms, covering both the horizontally and vertically federated learning scenarios. Numerical experiments have shown the prominent performace of these algorithms. XFL builds a concise configuration interfaces with presettings for all federation algorithms, and supports the rapid deployment via docker containers.Therefore, we believe XFL is the most user-friendly and easy-to-develop federated learning framework. XFL is open-sourced, and both the code and documents are available at https://github.com/paritybit-ai/XFL.  ( 2 min )
    From Group-Differences to Single-Subject Probability: Conformal Prediction-based Uncertainty Estimation for Brain-Age Modeling. (arXiv:2302.05304v1 [cs.LG])
    The brain-age gap is one of the most investigated risk markers for brain changes across disorders. While the field is progressing towards large-scale models, recently incorporating uncertainty estimates, no model to date provides the single-subject risk assessment capability essential for clinical application. In order to enable the clinical use of brain-age as a biomarker, we here combine uncertainty-aware deep Neural Networks with conformal prediction theory. This approach provides statistical guarantees with respect to single-subject uncertainty estimates and allows for the calculation of an individual's probability for accelerated brain-aging. Building on this, we show empirically in a sample of N=16,794 participants that 1. a lower or comparable error as state-of-the-art, large-scale brain-age models, 2. the statistical guarantees regarding single-subject uncertainty estimation indeed hold for every participant, and 3. that the higher individual probabilities of accelerated brain-aging derived from our model are associated with Alzheimer's Disease, Bipolar Disorder and Major Depressive Disorder.  ( 2 min )
    Using Connectome Features to Constrain Echo State Networks. (arXiv:2206.02094v2 [cs.LG] UPDATED)
    We report an improvement to the conventional Echo State Network (ESN) across three benchmark chaotic time-series prediction tasks using fruit fly connectome data alone. We also investigate the impact of key connectome-derived structural features on prediction performance -- uniquely bridging neurobiological structure and machine learning function; and find that both increasing the global average clustering coefficient and modifying the position of weights -- by permuting their synapse-synapse partners -- can lead to increased model variance and (in some cases) degraded performance. In all we consider four topological point modifications to a connectome-derived ESN reservoir (null model): namely, we alter the network sparsity, re-draw nonzero weights from a uniform distribution, permute nonzero weight positions, and increase the network global average clustering coefficient. We compare the four resulting ESN model classes -- and the null model -- with a conventional ESN by conducting time-series prediction experiments on size-variants of the Mackey-Glass 17 (MG-17), Lorenz, and Rossler chaotic time series; denoting each model's performance and variance across train-validate trials.  ( 2 min )
    Small-Text: Active Learning for Text Classification in Python. (arXiv:2107.10314v5 [cs.LG] UPDATED)
    We present small-text, an easy-to-use active learning library written in Python, which offers pool-based active learning for single- and multi-label text classification in Python. It features many pre-implemented state-of-the-art query strategies, including some that leverage the GPU. Standardized interfaces allow the combination of a variety of classifiers, query strategies, and stopping criteria, facilitating a quick mix and match, and enabling a rapid development of both active learning experiments and applications. In order to make various classifiers and query strategies accessible for active learning, small-text integrates several well-known machine learning libraries, namely scikit-learn, PyTorch, and Hugging Face transformers. The latter integrations are optionally installable extensions, so GPUs can be used but are not required. The library is publicly available under the MIT License at https://github.com/webis-de/small-text, in version 1.1.1 at the time of writing.  ( 2 min )
    Reinforcement Learning from Multiple Sensors via Joint Representations. (arXiv:2302.05342v1 [cs.LG])
    In many scenarios, observations from more than one sensor modality are available for reinforcement learning (RL). For example, many agents can perceive their internal state via proprioceptive sensors but must infer the environment's state from high-dimensional observations such as images. For image-based RL, a variety of self-supervised representation learning approaches exist to improve performance and sample complexity. These approaches learn the image representation in isolation. However, including proprioception can help representation learning algorithms to focus on relevant aspects and guide them toward finding better representations. Hence, in this work, we propose using Recurrent State Space Models to fuse all available sensory information into a single consistent representation. We combine reconstruction-based and contrastive approaches for training, which allows using the most appropriate method for each sensor modality. For example, we can use reconstruction for proprioception and a contrastive loss for images. We demonstrate the benefits of utilizing proprioception in learning representations for RL on a large set of experiments. Furthermore, we show that our joint representations significantly improve performance compared to a post hoc combination of image representations and proprioception.
    Matching Correlated Inhomogeneous Random Graphs using the $k$-core Estimator. (arXiv:2302.05407v1 [math.ST])
    We consider the task of estimating the latent vertex correspondence between two edge-correlated random graphs with generic, inhomogeneous structure. We study the so-called \emph{$k$-core estimator}, which outputs a vertex correspondence that induces a large, common subgraph of both graphs which has minimum degree at least $k$. We derive sufficient conditions under which the $k$-core estimator exactly or partially recovers the latent vertex correspondence. Finally, we specialize our general framework to derive new results on exact and partial recovery in correlated stochastic block models, correlated Chung-Lu graphs, and correlated random geometric graphs.  ( 2 min )
    Example-Based Sampling with Diffusion Models. (arXiv:2302.05116v1 [cs.GR])
    Much effort has been put into developing samplers with specific properties, such as producing blue noise, low-discrepancy, lattice or Poisson disk samples. These samplers can be slow if they rely on optimization processes, may rely on a wide range of numerical methods, are not always differentiable. The success of recent diffusion models for image generation suggests that these models could be appropriate for learning how to generate point sets from examples. However, their convolutional nature makes these methods impractical for dealing with scattered data such as point sets. We propose a generic way to produce 2-d point sets imitating existing samplers from observed point sets using a diffusion model. We address the problem of convolutional layers by leveraging neighborhood information from an optimal transport matching to a uniform grid, that allows us to benefit from fast convolutions on grids, and to support the example-based learning of non-uniform sampling patterns. We demonstrate how the differentiability of our approach can be used to optimize point sets to enforce properties.
    A Graph-Based Modeling Framework for Tracing Hydrological Pollutant Transport in Surface Waters. (arXiv:2302.04991v1 [cs.LG])
    Anthropogenic pollution of hydrological systems affects diverse communities and ecosystems around the world. Data analytics and modeling tools play a key role in fighting this challenge, as they can help identify key sources as well as trace transport and quantify impact within complex hydrological systems. Several tools exist for simulating and tracing pollutant transport throughout surface waters using detailed physical models; these tools are powerful, but can be computationally intensive, require significant amounts of data to be developed, and require expert knowledge for their use (ultimately limiting application scope). In this work, we present a graph modeling framework -- which we call ${\tt HydroGraphs}$ -- for understanding pollutant transport and fate across waterbodies, rivers, and watersheds. This framework uses a simplified representation of hydrological systems that can be constructed based purely on open-source data (National Hydrography Dataset and Watershed Boundary Dataset). The graph representation provides an flexible intuitive approach for capturing connectivity and for identifying upstream pollutant sources and for tracing downstream impacts within small and large hydrological systems. Moreover, the graph representation can facilitate the use of advanced algorithms and tools of graph theory, topology, optimization, and machine learning to aid data analytics and decision-making. We demonstrate the capabilities of our framework by using case studies in the State of Wisconsin; here, we aim to identify upstream nutrient pollutant sources that arise from agricultural practices and trace downstream impacts to waterbodies, rivers, and streams. Our tool ultimately seeks to help stakeholders design effective pollution prevention/mitigation practices and evaluate how surface waters respond to such practices.
    Black-Box Generalization: Stability of Zeroth-Order Learning. (arXiv:2202.06880v2 [cs.LG] UPDATED)
    We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well.  ( 2 min )
    Toward Degree Bias in Embedding-Based Knowledge Graph Completion. (arXiv:2302.05044v1 [cs.LG])
    A fundamental task for knowledge graphs (KGs) is knowledge graph completion (KGC). It aims to predict unseen edges by learning representations for all the entities and relations in a KG. A common concern when learning representations on traditional graphs is degree bias. It can affect graph algorithms by learning poor representations for lower-degree nodes, often leading to low performance on such nodes. However, there has been limited research on whether there exists degree bias for embedding-based KGC and how such bias affects the performance of KGC. In this paper, we validate the existence of degree bias in embedding-based KGC and identify the key factor to degree bias. We then introduce a novel data augmentation method, KG-Mixup, to generate synthetic triples to mitigate such bias. Extensive experiments have demonstrated that our method can improve various embedding-based KGC methods and outperform other methods tackling the bias problem on multiple benchmark datasets.
    Virtual Reality via Object Pose Estimation and Active Learning: Realizing Telepresence Robots with Aerial Manipulation Capabilities. (arXiv:2210.09678v2 [cs.RO] UPDATED)
    This article presents a novel telepresence system for advancing aerial manipulation in dynamic and unstructured environments. The proposed system not only features a haptic device, but also a virtual reality (VR) interface that provides real-time 3D displays of the robot's workspace as well as a haptic guidance to its remotely located operator. To realize this, multiple sensors namely a LiDAR, cameras and IMUs are utilized. For processing of the acquired sensory data, pose estimation pipelines are devised for industrial objects of both known and unknown geometries. We further propose an active learning pipeline in order to increase the sample efficiency of a pipeline component that relies on Deep Neural Networks (DNNs) based object detection. All these algorithms jointly address various challenges encountered during the execution of perception tasks in industrial scenarios. In the experiments, exhaustive ablation studies are provided to validate the proposed pipelines. Methodologically, these results commonly suggest how an awareness of the algorithms' own failures and uncertainty (`introspection') can be used tackle the encountered problems. Moreover, outdoor experiments are conducted to evaluate the effectiveness of the overall system in enhancing aerial manipulation capabilities. In particular, with flight campaigns over days and nights, from spring to winter, and with different users and locations, we demonstrate over 70 robust executions of pick-and-place, force application and peg-in-hole tasks with the DLR cable-Suspended Aerial Manipulator (SAM). As a result, we show the viability of the proposed system in future industrial applications.
    Online Real-Time Recurrent Learning Using Sparse Connections and Selective Learning. (arXiv:2302.05326v1 [cs.LG])
    State construction from sensory observations is an important component of a reinforcement learning agent. One solution for state construction is to use recurrent neural networks. Two popular gradient-based methods for recurrent learning are back-propagation through time (BPTT), and real-time recurrent learning (RTRL). BPTT looks at the complete sequence of observations before computing gradients and is unsuitable for online real-time updates. RTRL can do online updates but scales poorly to large networks. In this paper, we propose two constraints that make RTRL scalable. We show that by either decomposing the network into independent modules or learning a recurrent network incrementally, we can make RTRL scale linearly with the number of parameters. Unlike prior scalable gradient estimation algorithms, such as UORO and Truncated-BPTT, our algorithms do not add noise or bias to the gradient estimate. Instead, they trade off the functional capacity of the recurrent network to achieve scalable learning. We demonstrate the effectiveness of our approach over Truncated-BPTT on a benchmark inspired by animal learning and in policy evaluation for expert Rainbow-DQN agents in the Arcade Learning Environment (ALE).
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v5 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).
    Out-of-Distribution Representation Learning for Time Series Classification. (arXiv:2209.07027v3 [cs.LG] UPDATED)
    Time series classification is an important problem in real world. Due to its non-stationary property that the distribution changes over time, it remains challenging to build models for generalization to unseen distributions. In this paper, we propose to view the time series classification problem from the distribution perspective. We argue that the temporal complexity attributes to the unknown latent distributions within. To this end, we propose DIVERSIFY to learn generalized representations for time series classification. DIVERSIFY takes an iterative process: it first obtains the worst-case distribution scenario via adversarial training, then matches the distributions of the obtained sub-domains. We also present some theoretical insights. We conduct experiments on gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition with a total of seven datasets in different settings. Results demonstrate that DIVERSIFY significantly outperforms other baselines and effectively characterizes the latent distributions by qualitative and quantitative analysis. Code is available at: https://github.com/microsoft/robustlearn.
    On the Convergence of Stochastic Gradient Descent for Linear Inverse Problems in Banach Spaces. (arXiv:2302.05197v1 [cs.LG])
    In this work we consider stochastic gradient descent (SGD) for solving linear inverse problems in Banach spaces. SGD and its variants have been established as one of the most successful optimisation methods in machine learning, imaging and signal processing, etc. At each iteration SGD uses a single datum, or a small subset of data, resulting in highly scalable methods that are very attractive for large-scale inverse problems. Nonetheless, the theoretical analysis of SGD-based approaches for inverse problems has thus far been largely limited to Euclidean and Hilbert spaces. In this work we present a novel convergence analysis of SGD for linear inverse problems in general Banach spaces: we show the almost sure convergence of the iterates to the minimum norm solution and establish the regularising property for suitable a priori stopping criteria. Numerical results are also presented to illustrate features of the approach.
    DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation. (arXiv:2202.00972v2 [eess.IV] CROSS LISTED)
    Deep learning architecture with convolutional neural network (CNN) achieves outstanding success in the field of computer vision. Where U-Net, an encoder-decoder architecture structured by CNN, makes a great breakthrough in biomedical image segmentation and has been applied in a wide range of practical scenarios. However, the equal design of every downsampling layer in the encoder part and simply stacked convolutions do not allow U-Net to extract sufficient information of features from different depths. The increasing complexity of medical images brings new challenges to the existing methods. In this paper, we propose a deeper and more compact split-attention u-shape network (DCSAU-Net), which efficiently utilises low-level and high-level semantic information based on two novel frameworks: primary feature conservation and compact split-attention block. We evaluate the proposed model on CVC-ClinicDB, 2018 Data Science Bowl, ISIC-2018 and SegPC-2021 datasets. As a result, DCSAU-Net displays better performance than other state-of-the-art (SOTA) methods in terms of the mean Intersection over Union (mIoU) and F1-socre. More significantly, the proposed model demonstrates excellent segmentation performance on challenging images. The code for our work and more technical details can be found at https://github.com/xq141839/DCSAU-Net.
    Controllability-Aware Unsupervised Skill Discovery. (arXiv:2302.05103v1 [cs.RO])
    One of the key capabilities of intelligent agents is the ability to discover useful skills without external supervision. However, the current unsupervised skill discovery methods are often limited to acquiring simple, easy-to-learn skills due to the lack of incentives to discover more complex, challenging behaviors. We introduce a novel unsupervised skill discovery method, Controllability-aware Skill Discovery (CSD), which actively seeks complex, hard-to-control skills without supervision. The key component of CSD is a controllability-aware distance function, which assigns larger values to state transitions that are harder to achieve with the current skills. Combined with distance-maximizing skill discovery, CSD progressively learns more challenging skills over the course of training as our jointly trained distance function reduces rewards for easy-to-achieve skills. Our experimental results in six robotic manipulation and locomotion environments demonstrate that CSD can discover diverse complex skills including object manipulation and locomotion skills with no supervision, significantly outperforming prior unsupervised skill discovery methods. Videos and code are available at https://sites.google.com/view/icml2023csd
    Monte Carlo Neural Operator for Learning PDEs via Probabilistic Representation. (arXiv:2302.05104v1 [cs.LG])
    Neural operators, which use deep neural networks to approximate the solution mappings of partial differential equation (PDE) systems, are emerging as a new paradigm for PDE simulation. The neural operators could be trained in supervised or unsupervised ways, i.e., by using the generated data or the PDE information. The unsupervised training approach is essential when data generation is costly or the data is less qualified (e.g., insufficient and noisy). However, its performance and efficiency have plenty of room for improvement. To this end, we design a new loss function based on the Feynman-Kac formula and call the developed neural operator Monte-Carlo Neural Operator (MCNO), which can allow larger temporal steps and efficiently handle fractional diffusion operators. Our analyses show that MCNO has advantages in handling complex spatial conditions and larger temporal steps compared with other unsupervised methods. Furthermore, MCNO is more robust with the perturbation raised by the numerical scheme and operator approximation. Numerical experiments on the diffusion equation and Navier-Stokes equation show significant accuracy improvement compared with other unsupervised baselines, especially for the vibrated initial condition and long-time simulation settings.
    Debiasing Recommendation by Learning Identifiable Latent Confounders. (arXiv:2302.05052v1 [cs.LG])
    Recommendation systems aim to predict users' feedback on items not exposed to them. Confounding bias arises due to the presence of unmeasured variables (e.g., the socio-economic status of a user) that can affect both a user's exposure and feedback. Existing methods either (1) make untenable assumptions about these unmeasured variables or (2) directly infer latent confounders from users' exposure. However, they cannot guarantee the identification of counterfactual feedback, which can lead to biased predictions. In this work, we propose a novel method, i.e., identifiable deconfounder (iDCF), which leverages a set of proxy variables (e.g., observed user features) to resolve the aforementioned non-identification issue. The proposed iDCF is a general deconfounded recommendation framework that applies proximal causal inference to infer the unmeasured confounders and identify the counterfactual feedback with theoretical guarantees. Extensive experiments on various real-world and synthetic datasets verify the proposed method's effectiveness and robustness.
    Event Temporal Relation Extraction with Bayesian Translational Model. (arXiv:2302.04985v1 [cs.CL])
    Existing models to extract temporal relations between events lack a principled method to incorporate external knowledge. In this study, we introduce Bayesian-Trans, a Bayesian learning-based method that models the temporal relation representations as latent variables and infers their values via Bayesian inference and translational functions. Compared to conventional neural approaches, instead of performing point estimation to find the best set parameters, the proposed model infers the parameters' posterior distribution directly, enhancing the model's capability to encode and express uncertainty about the predictions. Experimental results on the three widely used datasets show that Bayesian-Trans outperforms existing approaches for event temporal relation extraction. We additionally present detailed analyses on uncertainty quantification, comparison of priors, and ablation studies, illustrating the benefits of the proposed approach.
    DNArch: Learning Convolutional Neural Architectures by Backpropagation. (arXiv:2302.05400v1 [cs.LG])
    We present Differentiable Neural Architectures (DNArch), a method that jointly learns the weights and the architecture of Convolutional Neural Networks (CNNs) by backpropagation. In particular, DNArch allows learning (i) the size of convolutional kernels at each layer, (ii) the number of channels at each layer, (iii) the position and values of downsampling layers, and (iv) the depth of the network. To this end, DNArch views neural architectures as continuous multidimensional entities, and uses learnable differentiable masks along each dimension to control their size. Unlike existing methods, DNArch is not limited to a predefined set of possible neural components, but instead it is able to discover entire CNN architectures across all combinations of kernel sizes, widths, depths and downsampling. Empirically, DNArch finds performant CNN architectures for several classification and dense prediction tasks on both sequential and image data. When combined with a loss term that considers the network complexity, DNArch finds powerful architectures that respect a predefined computational budget.
    Archaeological Sites Detection with a Human-AI Collaboration Workflow. (arXiv:2302.05286v1 [cs.CV])
    This paper illustrates the results obtained by using pre-trained semantic segmentation deep learning models for the detection of archaeological sites within the Mesopotamian floodplains environment. The models were fine-tuned using openly available satellite imagery and vector shapes coming from a large corpus of annotations (i.e., surveyed sites). A randomized test showed that the best model reaches a detection accuracy in the neighborhood of 80%. Integrating domain expertise was crucial to define how to build the dataset and how to evaluate the predictions, since defining if a proposed mask counts as a prediction is very subjective. Furthermore, even an inaccurate prediction can be useful when put into context and interpreted by a trained archaeologist. Coming from these considerations we close the paper with a vision for a Human-AI collaboration workflow. Starting with an annotated dataset that is refined by the human expert we obtain a model whose predictions can either be combined to create a heatmap, to be overlaid on satellite and/or aerial imagery, or alternatively can be vectorized to make further analysis in a GIS software easier and automatic. In turn, the archaeologists can analyze the predictions, organize their onsite surveys, and refine the dataset with new, corrected, annotation
    Attention-based Domain Adaption Forecasting of Streamflow in Data Sparse Regions. (arXiv:2302.05386v1 [cs.LG])
    Streamflow forecasts are critical to guide water resource management, mitigate drought and flood effects, and develop climate-smart infrastructure and industries. Many global regions, however, have limited streamflow observations to guide evidence-based management strategies. In this paper, we propose an attention-based domain adaptation streamflow forecaster for data-sparse regions. Our approach leverages the hydrological characteristics of a data-rich source domain to induce effective 24h lead-time streamflow prediction in a limited target domain. Specifically, we employ a deep-learning framework leveraging domain adaptation techniques to simultaneously train streamflow predictions and discern between both domains using an adversarial method. Experiments against baseline cross-domain forecasting models show improved performance for 24h lead-time streamflow forecasting.
    Efficient Propagation of Uncertainty via Reordering Monte Carlo Samples. (arXiv:2302.04945v1 [cs.LG])
    Uncertainty analysis in the outcomes of model predictions is a key element in decision-based material design to establish confidence in the models and evaluate the fidelity of models. Uncertainty Propagation (UP) is a technique to determine model output uncertainties based on the uncertainty in its input variables. The most common and simplest approach to propagate the uncertainty from a model inputs to its outputs is by feeding a large number of samples to the model, known as Monte Carlo (MC) simulation which requires exhaustive sampling from the input variable distributions. However, MC simulations are impractical when models are computationally expensive. In this work, we investigate the hypothesis that while all samples are useful on average, some samples must be more useful than others. Thus, reordering MC samples and propagating more useful samples can lead to enhanced convergence in statistics of interest earlier and thus, reducing the computational burden of UP process. Here, we introduce a methodology to adaptively reorder MC samples and show how it results in reduction of computational expense of UP processes.
    On counterfactual inference with unobserved confounding. (arXiv:2211.08209v2 [cs.LG] UPDATED)
    Given an observational study with $n$ independent but heterogeneous units, our goal is to learn the counterfactual distribution for each unit using only one $p$-dimensional sample per unit containing covariates, interventions, and outcomes. Specifically, we allow for unobserved confounding that introduces statistical biases between interventions and outcomes as well as exacerbates the heterogeneity across units. Modeling the underlying joint distribution as an exponential family, we reduce learning the unit-level counterfactual distributions to learning $n$ exponential family distributions with heterogeneous parameters and only one sample per distribution. We introduce a convex objective that pools all $n$ samples to jointly learn all $n$ parameter vectors, and provide a unit-wise mean squared error bound that scales linearly with the metric entropy of the parameter space. For example, when the parameters are $s$-sparse linear combination of $k$ known vectors, the error is $O(s\log k/p)$. En route, we derive sufficient conditions for compactly supported distributions to satisfy the logarithmic Sobolev inequality. As an application of the framework, our results enable consistent imputation of sparsely missing covariates.
    Modeling Volatility and Dependence of European Carbon and Energy Prices. (arXiv:2208.14311v4 [q-fin.ST] UPDATED)
    We study the prices of European Emission Allowances (EUA), whereby we analyze their uncertainty and dependencies on related energy prices (natural gas, coal, and oil). We propose a probabilistic multivariate conditional time series model with a VECM-Copula-GARCH structure which exploits key characteristics of the data. Data are normalized with respect to inflation and carbon emissions to allow for proper cross-series evaluation. The forecasting performance is evaluated in an extensive rolling-window forecasting study, covering eight years out-of-sample. We discuss our findings for both levels- and log-transformed data, focusing on time-varying correlations, and in view of the Russian invasion of Ukraine.
    What is Wrong with Continual Learning in Medical Image Segmentation?. (arXiv:2010.11008v2 [cs.CV] UPDATED)
    Continual learning protocols are attracting increasing attention from the medical imaging community. In continual environments, datasets acquired under different conditions arrive sequentially; and each is only available for a limited period of time. Given the inherent privacy risks associated with medical data, this setup reflects the reality of deployment for deep learning diagnostic radiology systems. Many techniques exist to learn continuously for image classification, and several have been adapted to semantic segmentation. Yet most struggle to accumulate knowledge in a meaningful manner. Instead, they focus on preventing the problem of catastrophic forgetting, even when this reduces model plasticity and thereon burdens the training process. This puts into question whether the additional overhead of knowledge preservation is worth it - particularly for medical image segmentation, where computation requirements are already high - or if maintaining separate models would be a better solution. We propose UNEG, a simple and widely applicable multi-model benchmark that maintains separate segmentation and autoencoder networks for each training stage. The autoencoder is built from the same architecture as the segmentation network, which in our case is a full-resolution nnU-Net, to bypass any additional design decisions. During inference, the reconstruction error is used to select the most appropriate segmenter for each test image. Open this concept, we develop a fair evaluation scheme for different continual learning settings that moves beyond the prevention of catastrophic forgetting. Our results across three regions of interest (prostate, hippocampus, and right ventricle) show that UNEG outperforms several continual learning methods, reinforcing the need for strong baselines in continual learning research.
    Numerical Methods For PDEs Over Manifolds Using Spectral Physics Informed Neural Networks. (arXiv:2302.05322v1 [cs.LG])
    We introduce an approach for solving PDEs over manifolds using physics informed neural networks whose architecture aligns with spectral methods. The networks are trained to take in as input samples of an initial condition, a time stamp and point(s) on the manifold and then output the solution's value at the given time and point(s). We provide proofs of our method for the heat equation on the interval and examples of unique network architectures that are adapted to nonlinear equations on the sphere and the torus. We also show that our spectral-inspired neural network architectures outperform the standard physics informed architectures. Our extensive experimental results include generalization studies where the testing dataset of initial conditions is randomly sampled from a significantly larger space than the training set.
    CVTT: Cross-Validation Through Time. (arXiv:2205.05393v2 [cs.LG] UPDATED)
    The evaluation of recommender systems from a practical perspective is a topic of ongoing discourse within the research community. While many current evaluation methods reduce performance to a single value metric as an easy way to compare models, it relies on the assumption that the methods' performance remains constant over time. In this study, we examine this assumption and propose the Cross-Validation Thought Time (CVTT) technique as a more comprehensive evaluation method, focusing on model performance over time. By utilizing the proposed technique, we conduct an in-depth analysis of the performance of popular RecSys algorithms. Our findings indicate that (1) the performance of the recommenders varies over time for all reviewed datasets, (2) using simple evaluation approaches can lead to a substantial decrease in performance in real-world evaluation scenarios, and (3) excessive data usage can lead to suboptimal results.
    Scaling Vision Transformers to 22 Billion Parameters. (arXiv:2302.05442v1 [cs.CV])
    The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
    STERLING: Synergistic Representation Learning on Bipartite Graphs. (arXiv:2302.05428v1 [cs.LG])
    The bipartite graph is a powerful data structure for modeling interactions between two types of nodes, of which a fundamental challenge is how to extract informative node embeddings. Self-Supervised Learning (SSL) is a promising paradigm to address this challenge. Most recent bipartite graph SSL methods are based on contrastive learning which learns embeddings by discriminating positive and negative node pairs. Contrastive learning usually requires a large number of negative node pairs, which could lead to computational burden and semantic errors. In this paper, we introduce a novel synergistic representation learning model (STERLING) to learn node embeddings without negative node pairs. STERLING preserves the unique synergies in bipartite graphs. The local and global synergies are captured by maximizing the similarity of the inter-type and intra-type positive node pairs, and maximizing the mutual information of co-clusters respectively. Theoretical analysis demonstrates that STERLING could preserve the synergies in the embedding space. Extensive empirical evaluation on various benchmark datasets and tasks demonstrates the effectiveness of STERLING for extracting node embeddings.
    Query Processing on Tensor Computation Runtimes. (arXiv:2203.01877v4 [cs.DB] UPDATED)
    The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how database management systems can ride the wave of innovation happening in the AI space. We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10$\times$ over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9$\times$ speedup over CPU baselines.
    Optimizing Serially Concatenated Neural Codes with Classical Decoders. (arXiv:2212.10355v2 [cs.IT] UPDATED)
    For improving short-length codes, we demonstrate that classic decoders can also be used with real-valued, neural encoders, i.e., deep-learning based codeword sequence generators. Here, the classical decoder can be a valuable tool to gain insights into these neural codes and shed light on weaknesses. Specifically, the turbo-autoencoder is a recently developed channel coding scheme where both encoder and decoder are replaced by neural networks. We first show that the limited receptive field of convolutional neural network (CNN)-based codes enables the application of the BCJR algorithm to optimally decode them with feasible computational complexity. These maximum a posteriori (MAP) component decoders then are used to form classical (iterative) turbo decoders for parallel or serially concatenated CNN encoders, offering a close-to-maximum likelihood (ML) decoding of the learned codes. To the best of our knowledge, this is the first time that a classical decoding algorithm is applied to a non-trivial, real-valued neural code. Furthermore, as the BCJR algorithm is fully differentiable, it is possible to train, or fine-tune, the neural encoder in an end-to-end fashion.
    Discovering Sparse Representations of Lie Groups with Machine Learning. (arXiv:2302.05383v1 [hep-ph])
    Recent work has used deep learning to derive symmetry transformations, which preserve conserved quantities, and to obtain the corresponding algebras of generators. In this letter, we extend this technique to derive sparse representations of arbitrary Lie algebras. We show that our method reproduces the canonical (sparse) representations of the generators of the Lorentz group, as well as the $U(n)$ and $SU(n)$ families of Lie groups. This approach is completely general and can be used to find the infinitesimal generators for any Lie group.
    Heterogeneous Graph Masked Autoencoders. (arXiv:2208.09957v2 [cs.LG] UPDATED)
    Generative self-supervised learning (SSL), especially masked autoencoders, has become one of the most exciting learning paradigms and has shown great potential in handling graph data. However, real-world graphs are always heterogeneous, which poses three critical challenges that existing methods ignore: 1) how to capture complex graph structure? 2) how to incorporate various node attributes? and 3) how to encode different node positions? In light of this, we study the problem of generative SSL on heterogeneous graphs and propose HGMAE, a novel heterogeneous graph masked autoencoder model to address these challenges. HGMAE captures comprehensive graph information via two innovative masking techniques and three unique training strategies. In particular, we first develop metapath masking and adaptive attribute masking with dynamic mask rate to enable effective and stable learning on heterogeneous graphs. We then design several training strategies including metapath-based edge reconstruction to adopt complex structural information, target attribute restoration to incorporate various node attributes, and positional feature prediction to encode node positional information. Extensive experiments demonstrate that HGMAE outperforms both contrastive and generative state-of-the-art baselines on several tasks across multiple datasets. Codes are available at https://github.com/meettyj/HGMAE.
    Forward Learning with Top-Down Feedback: Empirical and Analytical Characterization. (arXiv:2302.05440v1 [cs.LG])
    "Forward-only" algorithms, which train neural networks while avoiding a backward pass, have recently gained attention as a way of solving the biologically unrealistic aspects of backpropagation. Here, we first discuss the similarities between two "forward-only" algorithms, the Forward-Forward and PEPITA frameworks, and demonstrate that PEPITA is equivalent to a Forward-Forward with top-down feedback connections. Then, we focus on PEPITA to address compelling challenges related to the "forward-only" rules, which include providing an analytical understanding of their dynamics and reducing the gap between their performance and that of backpropagation. We propose a theoretical analysis of the dynamics of PEPITA. In particular, we show that PEPITA is well-approximated by an "adaptive-feedback-alignment" algorithm and we analytically track its performance during learning in a prototype high-dimensional setting. Finally, we develop a strategy to apply the weight mirroring algorithm on "forward-only" algorithms with top-down feedback and we show how it impacts PEPITA's accuracy and convergence rate.
    Gamma-convergence of a nonlocal perimeter arising in adversarial machine learning. (arXiv:2211.15223v3 [math.AP] UPDATED)
    In this paper we prove Gamma-convergence of a nonlocal perimeter of Minkowski type to a local anisotropic perimeter. The nonlocal model describes the regularizing effect of adversarial training in binary classifications. The energy essentially depends on the interaction between two distributions modelling likelihoods for the associated classes. We overcome typical strict regularity assumptions for the distributions by only assuming that they have bounded $BV$ densities. In the natural topology coming from compactness, we prove Gamma-convergence to a weighted perimeter with weight determined by an anisotropic function of the two densities. Despite being local, this sharp interface limit reflects classification stability with respect to adversarial perturbations. We further apply our results to deduce Gamma-convergence of the associated total variations, to study the asymptotics of adversarial training, and to prove Gamma-convergence of graph discretizations for the nonlocal perimeter.
    The Role of Codeword-to-Class Assignments in Error-Correcting Codes: An Empirical Study. (arXiv:2302.05334v1 [cs.LG])
    Error-correcting codes (ECC) are used to reduce multiclass classification tasks to multiple binary classification subproblems. In ECC, classes are represented by the rows of a binary matrix, corresponding to codewords in a codebook. Codebooks are commonly either predefined or problem dependent. Given predefined codebooks, codeword-to-class assignments are traditionally overlooked, and codewords are implicitly assigned to classes arbitrarily. Our paper shows that these assignments play a major role in the performance of ECC. Specifically, we examine similarity-preserving assignments, where similar codewords are assigned to similar classes. Addressing a controversy in existing literature, our extensive experiments confirm that similarity-preserving assignments induce easier subproblems and are superior to other assignment policies in terms of their generalization performance. We find that similarity-preserving assignments make predefined codebooks become problem-dependent, without altering other favorable codebook properties. Finally, we show that our findings can improve predefined codebooks dedicated to extreme classification.
    DevFormer: A Symmetric Transformer for Context-Aware Device Placement. (arXiv:2205.13225v2 [cs.LG] UPDATED)
    In this paper, we present DevFormer, a novel transformer-based architecture for addressing the complex and computationally demanding problem of hardware design optimization. Despite the demonstrated efficacy of transformers in domains including natural language processing and computer vision, their use in hardware design has been limited by the scarcity of offline data. Our approach addresses this limitation by introducing strong inductive biases such as relative positional embeddings and action-permutation symmetricity that effectively capture the hardware context and enable efficient design optimization with limited offline data. We apply DevFormer to the problem of decoupling capacitor placement and show that it outperforms state-of-the-art methods in both simulated and real hardware, leading to improved performances while reducing the number of components by more than 30%. Finally, we show that our approach achieves promising results in other offline contextual learning-based combinatorial optimization tasks.
    PLOT: Prompt Learning with Optimal Transport for Vision-Language Models. (arXiv:2210.01253v2 [cs.CV] UPDATED)
    With the increasing attention to large vision-language models such as CLIP, there has been a significant amount of effort dedicated to building efficient prompts. Unlike conventional methods of only learning one single prompt, we propose to learn multiple comprehensive prompts to describe diverse characteristics of categories such as intrinsic attributes or extrinsic contexts. However, directly matching each prompt to the same visual feature is problematic, as it pushes the prompts to converge to one point. To solve this problem, we propose to apply optimal transport to match the vision and text modalities. Specifically, we first model images and the categories with visual and textual feature sets. Then, we apply a two-stage optimization strategy to learn the prompts. In the inner loop, we optimize the optimal transport distance to align visual features and prompts by the Sinkhorn algorithm, while in the outer loop, we learn the prompts by this distance from the supervised data. Extensive experiments are conducted on the few-shot recognition task and the improvement demonstrates the superiority of our method. The code is available at https://github.com/CHENGY12/PLOT.
    Predicting the cardinality of a reduced Gr\"obner basis. (arXiv:2302.05364v1 [math.AC])
    We use ansatz neural network models to predict key metrics of complexity for Gr\"obner bases of binomial ideals. This work illustrates why predictions with neural networks from Gr\"obner computations are not a straightforward process. Using two probabilistic models for random binomial ideals, we generate and make available a large data set that is able to capture sufficient variability in Gr\"obner complexity. We use this data to train neural networks and predict the cardinality of a reduced Gr\"obner basis and the maximum total degree of its elements. While the cardinality prediction problem is unlike classical problems tackled by machine learning, our simulations show that neural networks, providing performance statistics such as $r^2 = 0.401$, outperform naive guess or multiple regression models with $r^2 = 0.180$.
    A Practical Mixed Precision Algorithm for Post-Training Quantization. (arXiv:2302.05397v1 [cs.LG])
    Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
    Aerial View Goal Localization with Reinforcement Learning. (arXiv:2209.03694v2 [cs.CV] UPDATED)
    Climate-induced disasters are and will continue to be on the rise, and thus search-and-rescue (SAR) operations, where the task is to localize and assist one or several people who are missing, become increasingly relevant. In many cases the rough location may be known and a UAV can be deployed to explore a given, confined area to precisely localize the missing people. Due to time and battery constraints it is often critical that localization is performed as efficiently as possible. In this work we approach this type of problem by abstracting it as an aerial view goal localization task in a framework that emulates a SAR-like setup without requiring access to actual UAVs. In this framework, an agent operates on top of an aerial image (proxy for a search area) and is tasked with localizing a goal that is described in terms of visual cues. To further mimic the situation on an actual UAV, the agent is not able to observe the search area in its entirety, not even at low resolution, and thus it has to operate solely based on partial glimpses when navigating towards the goal. To tackle this task, we propose AiRLoc, a reinforcement learning (RL)-based model that decouples exploration (searching for distant goals) and exploitation (localizing nearby goals). Extensive evaluations show that AiRLoc outperforms heuristic search methods as well as alternative learnable approaches, and that it generalizes across datasets, e.g. to disaster-hit areas without seeing a single disaster scenario during training. We also conduct a proof-of-concept study which indicates that the learnable methods outperform humans on average. Code and models have been made publicly available at https://github.com/aleksispi/airloc.
    SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space. (arXiv:2204.02394v2 [cs.CV] UPDATED)
    We propose a method for 3D shape reconstruction from unoriented point clouds. Our method consists of a novel SE(3)-equivariant coordinate-based network (TF-ONet), that parametrizes the occupancy field of the shape and respects the inherent symmetries of the problem. In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular point cloud. Our architecture leverages equivariant attention layers that operate on local tokens. This mechanism enables local shape modelling, a crucial property for scalability to large scenes. Given an unoriented, sparse, noisy point cloud as input, we produce equivariant features for each point. These serve as keys and values for the subsequent equivariant cross-attention blocks that parametrize the occupancy field. By querying an arbitrary point in space, we predict its occupancy score. We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets. More importantly, local modelling together with SE(3)-equivariance create an ideal setting for SE(3) scene reconstruction. We show that by training only on single, aligned objects and without any pre-segmentation, we can reconstruct novel scenes containing arbitrarily many objects in random poses without any performance loss.
    Reinforcement Learning for Protocol Synthesis in Resource-Constrained Wireless Sensor and IoT Networks. (arXiv:2302.05300v1 [cs.NI])
    This article explores the concepts of online protocol synthesis using Reinforcement Learning (RL). The study is performed in the context of sensor and IoT networks with ultra low complexity wireless transceivers. The paper introduces the use of RL and Multi Armed Bandit (MAB), a specific type of RL, for Medium Access Control (MAC) under different network and traffic conditions. It then introduces a novel learning based protocol synthesis framework that addresses specific difficulties and limitations in medium access for both random access and time slotted networks. The mechanism does not rely on carrier sensing, network time-synchronization, collision detection, and other low level complex operations, thus making it ideal for ultra simple transceiver hardware used in resource constrained sensor and IoT networks. Additionally, the ability of independent protocol learning by the nodes makes the system robust and adaptive to the changes in network and traffic conditions. It is shown that the nodes can be trained to learn to avoid collisions, and to achieve network throughputs that are comparable to ALOHA based access protocols in sensor and IoT networks with simplest transceiver hardware. It is also shown that using RL, it is feasible to synthesize access protocols that can sustain network throughput at high traffic loads, which is not feasible in the ALOHA-based systems. The ability of the system to provide throughput fairness under network and traffic heterogeneities are also experimentally demonstrated.
    Improving Object-centric Learning with Query Optimization. (arXiv:2210.08990v2 [cs.CV] UPDATED)
    The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by investigating the potential of learnable queries as initializations for Slot-Attention learning, uniting it with efforts from existing attempts on improving Slot-Attention learning with bi-level optimization. With simple code adjustments on Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin. We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits great potential for concept binding and zero-shot learning. Our work is made publicly available at https://bo-qsa.github.io
    Temporal Domain Generalization with Drift-Aware Dynamic Neural Networks. (arXiv:2205.10664v3 [cs.LG] UPDATED)
    Temporal domain generalization is a promising yet extremely challenging area where the goal is to learn models under temporally changing data distributions and generalize to unseen data distributions following the trends of the change. The advancement of this area is challenged by: 1) characterizing data distribution drift and its impacts on models, 2) expressiveness in tracking the model dynamics, and 3) theoretical guarantee on the performance. To address them, we propose a Temporal Domain Generalization with Drift-Aware Dynamic Neural Network (DRAIN) framework. Specifically, we formulate the problem into a Bayesian framework that jointly models the relation between data and model dynamics. We then build a recurrent graph generation scenario to characterize the dynamic graph-structured neural networks learned across different time points. It captures the temporal drift of model parameters and data distributions and can predict models in the future without the presence of future data. In addition, we explore theoretical guarantees of the model performance under the challenging temporal DG setting and provide theoretical analysis, including uncertainty and generalization error. Finally, extensive experiments on several real-world benchmarks with temporal drift demonstrate the effectiveness and efficiency of the proposed method.
    Interventional Causal Representation Learning. (arXiv:2209.11924v3 [stat.ML] UPDATED)
    Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect $do$ interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.
    Trustworthy AI Inference Systems: An Industry Research View. (arXiv:2008.04449v2 [cs.CR] UPDATED)
    In this work, we provide an industry research view for approaching the design, deployment, and operation of trustworthy Artificial Intelligence (AI) inference systems. Such systems provide customers with timely, informed, and customized inferences to aid their decision, while at the same time utilizing appropriate security protection mechanisms for AI models. Additionally, such systems should also use Privacy-Enhancing Technologies (PETs) to protect customers' data at any time. To approach the subject, we start by introducing current trends in AI inference systems. We continue by elaborating on the relationship between Intellectual Property (IP) and private data protection in such systems. Regarding the protection mechanisms, we survey the security and privacy building blocks instrumental in designing, building, deploying, and operating private AI inference systems. For example, we highlight opportunities and challenges in AI systems using trusted execution environments combined with more recent advances in cryptographic techniques to protect data in use. Finally, we outline areas of further development that require the global collective attention of industry, academia, and government researchers to sustain the operation of trustworthy AI inference systems.
    Dive into Deep Learning. (arXiv:2106.11342v4 [cs.LG] UPDATED)
    This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions.
    Removing Structured Noise with Diffusion Models. (arXiv:2302.05290v1 [cs.LG])
    Solving ill-posed inverse problems requires careful formulation of prior beliefs over the signals of interest and an accurate description of their manifestation into noisy measurements. Handcrafted signal priors based on e.g. sparsity are increasingly replaced by data-driven deep generative models, and several groups have recently shown that state-of-the-art score-based diffusion models yield particularly strong performance and flexibility. In this paper, we show that the powerful paradigm of posterior sampling with diffusion models can be extended to include rich, structured, noise models. To that end, we propose a joint conditional reverse diffusion process with learned scores for the noise and signal-generating distribution. We demonstrate strong performance gains across various inverse problems with structured noise, outperforming competitive baselines that use normalizing flows and adversarial networks. This opens up new opportunities and relevant practical applications of diffusion modeling for inverse problems in the context of non-Gaussian measurements.
    Achieving Linear Speedup in Non-IID Federated Bilevel Learning. (arXiv:2302.05412v1 [cs.LG])
    Federated bilevel optimization has received increasing attention in various emerging machine learning and communication applications. Recently, several Hessian-vector-based algorithms have been proposed to solve the federated bilevel optimization problem. However, several important properties in federated learning such as the partial client participation and the linear speedup for convergence (i.e., the convergence rate and complexity are improved linearly with respect to the number of sampled clients) in the presence of non-i.i.d.~datasets, still remain open. In this paper, we fill these gaps by proposing a new federated bilevel algorithm named FedMBO with a novel client sampling scheme in the federated hypergradient estimation. We show that FedMBO achieves a convergence rate of $\mathcal{O}\big(\frac{1}{\sqrt{nK}}+\frac{1}{K}+\frac{\sqrt{n}}{K^{3/2}}\big)$ on non-i.i.d.~datasets, where $n$ is the number of participating clients in each round, and $K$ is the total number of iteration. This is the first theoretical linear speedup result for non-i.i.d.~federated bilevel optimization. Extensive experiments validate our theoretical results and demonstrate the effectiveness of our proposed method.
    Structured Prediction Problem Archive. (arXiv:2202.03574v3 [cs.LG] UPDATED)
    Structured prediction problems are one of the fundamental tools in machine learning. In order to facilitate algorithm development for their numerical solution, we collect in one place a large number of datasets in easy to read formats for a diverse set of problem classes. We provide archival links to datasets, description of the considered problems and problem formats, and a short summary of problem characteristics including size, number of instances etc. For reference we also give a non-exhaustive selection of algorithms proposed in the literature for their solution. We hope that this central repository will make benchmarking and comparison to established works easier. We welcome submission of interesting new datasets and algorithms for inclusion in our archive.
    Toric Geometry of Entropic Regularization. (arXiv:2202.01571v2 [math.OC] UPDATED)
    Entropic regularization is a method for large-scale linear programming. Geometrically, one traces intersections of the feasible polytope with scaled toric varieties, starting at the Birch point. We compare this to log-barrier methods, with reciprocal linear spaces, starting at the analytic center. We revisit entropic regularization for unbalanced optimal transport, and we develop the use of optimal conic couplings. We compute the degree of the associated toric variety, and we explore algorithms like iterative scaling.
    Optimization of Convolutional Neural Network Using the Linearly Decreasing Weight Particle Swarm Optimization. (arXiv:2001.05670v3 [cs.NE] UPDATED)
    Convolutional neural network (CNN) is one of the most frequently used deep learning techniques. Various forms of models have been proposed and im-proved for learning at CNN. When learning with CNN, it is necessary to determine the optimal hyperparameters. However, the number of hyperparameters is so large that it is difficult to do it manually, so much research has been done on automation. A method that uses metaheuristic algorithms is attracting attention in research on hyperparameter optimization. Metaheuristic algorithms are naturally inspired and include evolution strategies, genetic algorithms, antcolony optimization and particle swarm optimization. In particular, particle swarm optimization converges faster than genetic algorithms, and various models have been proposed. In this paper, we pro-pose CNN hyperparameter optimization with linearly decreasing weight particle swarm optimization (LDWPSO). In the experiment, the MNIST data set and CIFAR-10 data set, which are often used as benchmark data sets, are used. By opti-mizing CNN hyperparameters with LDWPSO, learning the MNIST and CIFAR-10 datasets, we compare the accuracy with a standard CNN based on LeNet-5. As a result, when using the MNIST dataset, the baseline CNN is 94.02% at the 5th epoch, compared to 98.95% for LDWPSO CNN, which improves accuracy. When using the CIFAR-10 dataset, the Baseline CNN is 28.07% at the 10th epoch, compared to 69.37% for the LDWPSO CNN, which greatly improves accuracy. This paper is presented at the 36th Annual Conference of the Japanese Society for Artificial In-telligence. The final version is available at the following URL: https://www.jst-age.jst.go.jp/article/pjsai/JSAI2022/0/JSAI2022_2S4IS2b03/_article/-char/en
    On Average-Case Error Bounds for Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v2 [stat.ML] UPDATED)
    In this paper, we study error bounds for {\em Bayesian quadrature} (BQ), with an emphasis on noisy settings, randomized algorithms, and average-case performance measures. We seek to approximate the integral of functions in a {\em Reproducing Kernel Hilbert Space} (RKHS), particularly focusing on the Mat\'ern-$\nu$ and squared exponential (SE) kernels, with samples from the function potentially being corrupted by Gaussian noise. We provide a two-step meta-algorithm that serves as a general tool for relating the average-case quadrature error with the $L^2$-function approximation error. When specialized to the Mat\'ern kernel, we recover an existing near-optimal error rate while avoiding the existing method of repeatedly sampling points. When specialized to other settings, we obtain new average-case results for settings including the SE kernel with noise and the Mat\'ern kernel with misspecification. Finally, we present algorithm-independent lower bounds that have greater generality and/or give distinct proofs compared to existing ones.
    Multilingual Normalization of Temporal Expressions with Masked Language Models. (arXiv:2205.10399v2 [cs.CL] UPDATED)
    The detection and normalization of temporal expressions is an important task and preprocessing step for many applications. However, prior work on normalization is rule-based, which severely limits the applicability in real-world multilingual settings, due to the costly creation of new rules. We propose a novel neural method for normalizing temporal expressions based on masked language modeling. Our multilingual method outperforms prior rule-based systems in many languages, and in particular, for low-resource languages with performance improvements of up to 33 F1 on average compared to the state of the art.
    How Biased are Your Features?: Computing Fairness Influence Functions with Global Sensitivity Analysis. (arXiv:2206.00667v2 [cs.LG] UPDATED)
    Fairness in machine learning has attained significant focus due to the widespread application in high-stake decision-making tasks. Unregulated machine learning classifiers can exhibit bias towards certain demographic groups in data, thus the quantification and mitigation of classifier bias is a central concern in fairness in machine learning. In this paper, we aim to quantify the influence of different features in a dataset on the bias of a classifier. To do this, we introduce the Fairness Influence Function (FIF). This function breaks down bias into its components among individual features and the intersection of multiple features. The key idea is to represent existing group fairness metrics as the difference of the scaled conditional variances in the classifier's prediction and apply a decomposition of variance according to global sensitivity analysis. To estimate FIFs, we instantiate an algorithm FairXplainer that applies variance decomposition of classifier's prediction following local regression. Experiments demonstrate that FairXplainer captures FIFs of individual feature and intersectional features, provides a better approximation of bias based on FIFs, demonstrates higher correlation of FIFs with fairness interventions, and detects changes in bias due to fairness affirmative/punitive actions in the classifier.
    Span-based Named Entity Recognition by Generating and Compressing Information. (arXiv:2302.05392v1 [cs.CL])
    The information bottleneck (IB) principle has been proven effective in various NLP applications. The existing work, however, only used either generative or information compression models to improve the performance of the target task. In this paper, we propose to combine the two types of IB models into one system to enhance Named Entity Recognition (NER). For one type of IB model, we incorporate two unsupervised generative components, span reconstruction and synonym generation, into a span-based NER system. The span reconstruction ensures that the contextualised span representation keeps the span information, while the synonym generation makes synonyms have similar representations even in different contexts. For the other type of IB model, we add a supervised IB layer that performs information compression into the system to preserve useful features for NER in the resulting span representations. Experiments on five different corpora indicate that jointly training both generative and information compression models can enhance the performance of the baseline span-based NER system. Our source code is publicly available at https://github.com/nguyennth/joint-ib-models.
    Should You Mask 15% in Masked Language Modeling?. (arXiv:2202.08005v3 [cs.CL] UPDATED)
    Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT's 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.
    A Second-Order Method for Stochastic Bandit Convex Optimisation. (arXiv:2302.05371v1 [cs.LG])
    We introduce a simple and efficient algorithm for unconstrained zeroth-order stochastic convex bandits and prove its regret is at most $(1 + r/d)[d^{1.5} \sqrt{n} + d^3] polylog(n, d, r)$ where $n$ is the horizon, $d$ the dimension and $r$ is the radius of a known ball containing the minimiser of the loss.
    Squeeze Training for Adversarial Robustness. (arXiv:2205.11156v2 [cs.LG] UPDATED)
    The vulnerability of deep neural networks (DNNs) to adversarial examples has attracted great attention in the machine learning community. The problem is related to non-flatness and non-smoothness of normally obtained loss landscapes. Training augmented with adversarial examples (a.k.a., adversarial training) is considered as an effective remedy. In this paper, we highlight that some collaborative examples, nearly perceptually indistinguishable from both adversarial and benign examples yet show extremely lower prediction loss, can be utilized to enhance adversarial training. A novel method is therefore proposed to achieve new state-of-the-arts in adversarial robustness. Code: https://github.com/qizhangli/ST-AT.
    Fine-tuning Partition-aware Item Similarities for Efficient and Scalable Recommendation. (arXiv:2207.05959v2 [cs.IR] UPDATED)
    Collaborative filtering (CF) is widely searched in recommendation with various types of solutions. Recent success of Graph Convolution Networks (GCN) in CF demonstrates the effectiveness of modeling high-order relationships through graphs, while repetitive graph convolution and iterative batch optimization limit their efficiency. Instead, item similarity models attempt to construct direct relationships through efficient interaction encoding. Despite their great performance, the growing item numbers result in quadratic growth in similarity modeling process, posing critical scalability problems. In this paper, we investigate the graph sampling strategy adopted in latest GCN model for efficiency improving, and identify the potential item group structure in the sampled graph. Based on this, we propose a novel item similarity model which introduces graph partitioning to restrict the item similarity modeling within each partition. Specifically, we show that the spectral information of the original graph is well in preserving global-level information. Then, it is added to fine-tune local item similarities with a new data augmentation strategy acted as partition-aware prior knowledge, jointly to cope with the information loss brought by partitioning. Experiments carried out on 4 datasets show that the proposed model outperforms state-of-the-art GCN models with 10x speed-up and item similarity models with 95\% parameter storage savings.
    A novel corrective-source term approach to modeling unknown physics in aluminum extraction process. (arXiv:2209.10861v2 [cs.LG] UPDATED)
    With the ever-increasing availability of data, there has been an explosion of interest in applying modern machine learning methods to fields such as modeling and control. However, despite the flexibility and surprising accuracy of such black-box models, it remains difficult to trust them. Recent efforts to combine the two approaches aim to develop flexible models that nonetheless generalize well; a paradigm we call Hybrid Analysis and modeling (HAM). In this work we investigate the Corrective Source Term Approach (CoSTA), which uses a data-driven model to correct a misspecified physics-based model. This enables us to develop models that make accurate predictions even when the underlying physics of the problem is not well understood. We apply CoSTA to model the Hall-H\'eroult process in an aluminum electrolysis cell. We demonstrate that the method improves both accuracy and predictive stability, yielding an overall more trustworthy model.
    Communication-Efficient Federated Hypergradient Computation via Aggregated Iterative Differentiation. (arXiv:2302.04969v1 [cs.LG])
    Federated bilevel optimization has attracted increasing attention due to emerging machine learning and communication applications. The biggest challenge lies in computing the gradient of the upper-level objective function (i.e., hypergradient) in the federated setting due to the nonlinear and distributed construction of a series of global Hessian matrices. In this paper, we propose a novel communication-efficient federated hypergradient estimator via aggregated iterative differentiation (AggITD). AggITD is simple to implement and significantly reduces the communication cost by conducting the federated hypergradient estimation and the lower-level optimization simultaneously. We show that the proposed AggITD-based algorithm achieves the same sample complexity as existing approximate implicit differentiation (AID)-based approaches with much fewer communication rounds in the presence of data heterogeneity. Our results also shed light on the great advantage of ITD over AID in the federated/distributed hypergradient estimation. This differs from the comparison in the non-distributed bilevel optimization, where ITD is less efficient than AID. Our extensive experiments demonstrate the great effectiveness and communication efficiency of the proposed method.  ( 2 min )
    A SWAT-based Reinforcement Learning Framework for Crop Management. (arXiv:2302.04988v1 [cs.LG])
    Crop management involves a series of critical, interdependent decisions or actions in a complex and highly uncertain environment, which exhibit distinct spatial and temporal variations. Managing resource inputs such as fertilizer and irrigation in the face of climate change, dwindling supply, and soaring prices is nothing short of a Herculean task. The ability of machine learning to efficiently interrogate complex, nonlinear, and high-dimensional datasets can revolutionize decision-making in agriculture. In this paper, we introduce a reinforcement learning (RL) environment that leverages the dynamics in the Soil and Water Assessment Tool (SWAT) and enables management practices to be assessed and evaluated on a watershed level. This drastically saves time and resources that would have been otherwise deployed during a full-growing season. We consider crop management as an optimization problem where the objective is to produce higher crop yield while minimizing the use of external farming inputs (specifically, fertilizer and irrigation amounts). The problem is naturally subject to environmental factors such as precipitation, solar radiation, temperature, and soil water content. We demonstrate the utility of our framework by developing and benchmarking various decision-making agents following management strategies informed by standard farming practices and state-of-the-art RL algorithms.  ( 2 min )
    ChemVise: Maximizing Out-of-Distribution Chemical Detection with the Novel Application of Zero-Shot Learning. (arXiv:2302.04917v1 [cs.LG])
    Accurate chemical sensors are vital in medical, military, and home safety applications. Training machine learning models to be accurate on real world chemical sensor data requires performing many diverse, costly experiments in controlled laboratory settings to create a data set. In practice even expensive, large data sets may be insufficient for generalization of a trained model to a real-world testing distribution. Rather than perform greater numbers of experiments requiring exhaustive mixtures of chemical analytes, this research proposes learning approximations of complex exposures from training sets of simple ones by using single-analyte exposure signals as building blocks of a multiple-analyte space. We demonstrate this approach to synthetic sensor responses surprisingly improves the detection of out-of-distribution obscured chemical analytes. Further, we pair these synthetic signals to targets in an information-dense representation space utilizing a large corpus of chemistry knowledge. Through utilization of a semantically meaningful analyte representation spaces along with synthetic targets we achieve rapid analyte classification in the presence of obscurants without corresponding obscured-analyte training data. Transfer learning for supervised learning with molecular representations makes assumptions about the input data. Instead, we borrow from the natural language and natural image processing literature for a novel approach to chemical sensor signal classification using molecular semantics for arbitrary chemical sensor hardware designs.  ( 2 min )
    Short-Term Aggregated Residential Load Forecasting using BiLSTM and CNN-BiLSTM. (arXiv:2302.05033v1 [cs.LG])
    Higher penetration of renewable and smart home technologies at the residential level challenges grid stability as utility-customer interactions add complexity to power system operations. In response, short-term residential load forecasting has become an increasing area of focus. However, forecasting at the residential level is challenging due to the higher uncertainties involved. Recently deep neural networks have been leveraged to address this issue. This paper investigates the capabilities of a bidirectional long short-term memory (BiLSTM) and a convolutional neural network-based BiLSTM (CNN-BiLSTM) to provide a day ahead (24 hr.) forecasting at an hourly resolution while minimizing the root mean squared error (RMSE) between the actual and predicted load demand. Using a publicly available dataset consisting of 38 homes, the BiLSTM and CNN-BiLSTM models are trained to forecast the aggregated active power demand for each hour within a 24 hr. span, given the previous 24 hr. load data. The BiLSTM model achieved the lowest RMSE of 1.4842 for the overall daily forecast. In addition, standard LSTM and CNN-LSTM models are trained and compared with the BiLSTM architecture. The RMSE of BiLSTM is 5.60%, 2.85% and 2.60% lower than the LSTM, CNN-LSTM and CNN-BiLSTM models respectively. The source code of this work is available at https://github.com/Varat7v2/STLF-BiLSTM-CNNBiLSTM.git.  ( 2 min )
    Gaussian Process-Gated Hierarchical Mixtures of Experts. (arXiv:2302.04947v1 [cs.LG])
    In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs) that are used for building gates and experts. Unlike in other mixtures of experts where the gating models are linear to the input, the gating functions of our model are inner nodes built with Gaussian processes based on random features that are non-linear and non-parametric. Further, the experts are also built with Gaussian processes and provide predictions that depend on test data. The optimization of the GPHMEs is carried out by variational inference. There are several advantages of the proposed GPHMEs. One is that they outperform tree-based HME benchmarks that partition the data in the input space. Another advantage is that they achieve good performance with reduced complexity. A third advantage of the GPHMEs is that they provide interpretability of deep Gaussian processes and more generally of deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets even with quite modest sizes.  ( 2 min )
    Two-step counterfactual generation for OOD examples. (arXiv:2302.05196v1 [cs.LG])
    Two fundamental requirements for the deployment of machine learning models in safety-critical systems are to be able to detect out-of-distribution (OOD) data correctly and to be able to explain the prediction of the model. Although significant effort has gone into both OOD detection and explainable AI, there has been little work on explaining why a model predicts a certain data point is OOD. In this paper, we address this question by introducing the concept of an OOD counterfactual, which is a perturbed data point that iteratively moves between different OOD categories. We propose a method for generating such counterfactuals, investigate its application on synthetic and benchmark data, and compare it to several benchmark methods using a range of metrics.
    ShapeWordNet: An Interpretable Shapelet Neural Network for Physiological Signal Classification. (arXiv:2302.05021v1 [cs.LG])
    Physiological signals are high-dimensional time series of great practical values in medical and healthcare applications. However, previous works on its classification fail to obtain promising results due to the intractable data characteristics and the severe label sparsity issues. In this paper, we try to address these challenges by proposing a more effective and interpretable scheme tailored for the physiological signal classification task. Specifically, we exploit the time series shapelets to extract prominent local patterns and perform interpretable sequence discretization to distill the whole-series information. By doing so, the long and continuous raw signals are compressed into short and discrete token sequences, where both local patterns and global contexts are well preserved. Moreover, to alleviate the label sparsity issue, a multi-scale transformation strategy is adaptively designed to augment data and a cross-scale contrastive learning mechanism is accordingly devised to guide the model training. We name our method as ShapeWordNet and conduct extensive experiments on three real-world datasets to investigate its effectiveness. Comparative results show that our proposed scheme remarkably outperforms four categories of cutting-edge approaches. Visualization analysis further witnesses the good interpretability of the sequence discretization idea based on shapelets.
    Conceptual Views on Tree Ensemble Classifiers. (arXiv:2302.05270v1 [cs.LG])
    Random Forests and related tree-based methods are popular for supervised learning from table based data. Apart from their ease of parallelization, their classification performance is also superior. However, this performance, especially parallelizability, is offset by the loss of explainability. Statistical methods are often used to compensate for this disadvantage. Yet, their ability for local explanations, and in particular for global explanations, is limited. In the present work we propose an algebraic method, rooted in lattice theory, for the (global) explanation of tree ensembles. In detail, we introduce two novel conceptual views on tree ensemble classifiers and demonstrate their explanatory capabilities on Random Forests that were trained with standard parameters.
    Near-Optimal Experimental Design Under the Budget Constraint in Online Platforms. (arXiv:2302.05005v1 [cs.LG])
    A/B testing, or controlled experiments, is the gold standard approach to causally compare the performance of algorithms on online platforms. However, conventional Bernoulli randomization in A/B testing faces many challenges such as spillover and carryover effects. Our study focuses on another challenge, especially for A/B testing on two-sided platforms -- budget constraints. Buyers on two-sided platforms often have limited budgets, where the conventional A/B testing may be infeasible to be applied, partly because two variants of allocation algorithms may conflict and lead some buyers to exceed their budgets if they are implemented simultaneously. We develop a model to describe two-sided platforms where buyers have limited budgets. We then provide an optimal experimental design that guarantees small bias and minimum variance. Bias is lower when there is more budget and a higher supply-demand rate. We test our experimental design on both synthetic data and real-world data, which verifies the theoretical results and shows our advantage compared to Bernoulli randomization.
    Predicting Out-of-Distribution Error with Confidence Optimal Transport. (arXiv:2302.05018v1 [cs.LG])
    Out-of-distribution (OOD) data poses serious challenges in deployed machine learning models as even subtle changes could incur significant performance drops. Being able to estimate a model's performance on test data is important in practice as it indicates when to trust to model's decisions. We present a simple yet effective method to predict a model's performance on an unknown distribution without any addition annotation. Our approach is rooted in the Optimal Transport theory, viewing test samples' output softmax scores from deep neural networks as empirical samples from an unknown distribution. We show that our method, Confidence Optimal Transport (COT), provides robust estimates of a model's performance on a target domain. Despite its simplicity, our method achieves state-of-the-art results on three benchmark datasets and outperforms existing methods by a large margin.
    Confidence-based Reliable Learning under Dual Noises. (arXiv:2302.05098v1 [cs.CV])
    Deep neural networks (DNNs) have achieved remarkable success in a variety of computer vision tasks, where massive labeled images are routinely required for model optimization. Yet, the data collected from the open world are unavoidably polluted by noise, which may significantly undermine the efficacy of the learned models. Various attempts have been made to reliably train DNNs under data noise, but they separately account for either the noise existing in the labels or that existing in the images. A naive combination of the two lines of works would suffer from the limitations in both sides, and miss the opportunities to handle the two kinds of noise in parallel. This work provides a first, unified framework for reliable learning under the joint (image, label)-noise. Technically, we develop a confidence-based sample filter to progressively filter out noisy data without the need of pre-specifying noise ratio. Then, we penalize the model uncertainty of the detected noisy data instead of letting the model continue over-fitting the misleading information in them. Experimental results on various challenging synthetic and real-world noisy datasets verify that the proposed method can outperform competing baselines in the aspect of classification performance.
    Cross-Corpora Spoken Language Identification with Domain Diversification and Generalization. (arXiv:2302.05110v1 [eess.AS])
    This work addresses the cross-corpora generalization issue for the low-resourced spoken language identification (LID) problem. We have conducted the experiments in the context of Indian LID and identified strikingly poor cross-corpora generalization due to corpora-dependent non-lingual biases. Our contribution to this work is twofold. First, we propose domain diversification, which diversifies the limited training data using different audio data augmentation methods. We then propose the concept of maximally diversity-aware cascaded augmentations and optimize the augmentation fold-factor for effective diversification of the training data. Second, we introduce the idea of domain generalization considering the augmentation methods as pseudo-domains. Towards this, we investigate both domain-invariant and domain-aware approaches. Our LID system is based on the state-of-the-art emphasized channel attention, propagation, and aggregation based time delay neural network (ECAPA-TDNN) architecture. We have conducted extensive experiments with three widely used corpora for Indian LID research. In addition, we conduct a final blind evaluation of our proposed methods on the Indian subset of VoxLingua107 corpus collected in the wild. Our experiments demonstrate that the proposed domain diversification is more promising over commonly used simple augmentation methods. The study also reveals that domain generalization is a more effective solution than domain diversification. We also notice that domain-aware learning performs better for same-corpora LID, whereas domain-invariant learning is more suitable for cross-corpora generalization. Compared to basic ECAPA-TDNN, its proposed domain-invariant extensions improve the cross-corpora EER up to 5.23%. In contrast, the proposed domain-aware extensions also improve performance for same-corpora test scenarios.
    MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope. (arXiv:2302.05294v1 [cs.CV])
    Explaining the predictions of deep neural nets has been a topic of great interest in the computer vision literature. While several gradient-based interpretation schemes have been proposed to reveal the influential variables in a neural net's prediction, standard gradient-based interpretation frameworks have been commonly observed to lack robustness to input perturbations and flexibility for incorporating prior knowledge of sparsity and group-sparsity structures. In this work, we propose MoreauGrad as an interpretation scheme based on the classifier neural net's Moreau envelope. We demonstrate that MoreauGrad results in a smooth and robust interpretation of a multi-layer neural network and can be efficiently computed through first-order optimization methods. Furthermore, we show that MoreauGrad can be naturally combined with $L_1$-norm regularization techniques to output a sparse or group-sparse explanation which are prior conditions applicable to a wide range of deep learning applications. We empirically evaluate the proposed MoreauGrad scheme on standard computer vision datasets, showing the qualitative and quantitative success of the MoreauGrad approach in comparison to standard gradient-based interpretation methods.
    Towards Minimax Optimality of Model-based Robust Reinforcement Learning. (arXiv:2302.05372v1 [cs.LG])
    We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.
    Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training. (arXiv:2302.05045v1 [cs.LG])
    Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication times and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication times by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
    Effects of noise on the overparametrization of quantum neural networks. (arXiv:2302.05059v1 [quant-ph])
    Overparametrization is one of the most surprising and notorious phenomena in machine learning. Recently, there have been several efforts to study if, and how, Quantum Neural Networks (QNNs) acting in the absence of hardware noise can be overparametrized. In particular, it has been proposed that a QNN can be defined as overparametrized if it has enough parameters to explore all available directions in state space. That is, if the rank of the Quantum Fisher Information Matrix (QFIM) for the QNN's output state is saturated. Here, we explore how the presence of noise affects the overparametrization phenomenon. Our results show that noise can "turn on" previously-zero eigenvalues of the QFIM. This enables the parametrized state to explore directions that were otherwise inaccessible, thus potentially turning an overparametrized QNN into an underparametrized one. For small noise levels, the QNN is quasi-overparametrized, as large eigenvalues coexists with small ones. Then, we prove that as the magnitude of noise increases all the eigenvalues of the QFIM become exponentially suppressed, indicating that the state becomes insensitive to any change in the parameters. As such, there is a pull-and-tug effect where noise can enable new directions, but also suppress the sensitivity to parameter updates. Finally, our results imply that current QNN capacity measures are ill-defined when hardware noise is present.
    On the Whitney near extension problem, BMO, alignment of data, best approximation in algebraic geometry, manifold learning and their beautiful connections: A modern treatment. (arXiv:2103.09748v7 [math.CA] UPDATED)
    This paper provides fascinating connections between several mathematical problems which lie on the intersection of several mathematics subjects, namely algebraic geometry, approximation theory, complex-harmonic analysis and high dimensional data science. Modern techniques in algebraic geometry, approximation theory, computational harmonic analysis and extensions develop the first of its kind, a unified framework which allows for a simultaneous study of labeled and unlabeled near alignment data problems in of $\mathbb R^D$ with the near isometry extension problem for discrete and non-discrete subsets of $\mathbb R^D$ with certain geometries. In addition, the paper surveys related work on clustering, dimension reduction, manifold learning, vision as well as minimal energy partitions, discrepancy and min-max optimization. Numerous open problems are given.
    AutoNMT: A Framework to Streamline the Research of Seq2Seq Models. (arXiv:2302.04981v1 [cs.CL])
    We present AutoNMT, a framework to streamline the research of seq-to-seq models by automating the data pipeline (i.e., file management, data preprocessing, and exploratory analysis), automating experimentation in a toolkit-agnostic manner, which allows users to use either their own models or existing seq-to-seq toolkits such as Fairseq or OpenNMT, and finally, automating the report generation (plots and summaries). Furthermore, this library comes with its own seq-to-seq toolkit so that users can easily customize it for non-standard tasks.
    Long-Tailed Partial Label Learning via Dynamic Rebalancing. (arXiv:2302.05080v1 [cs.LG])
    Real-world data usually couples the label ambiguity and heavy imbalance, challenging the algorithmic robustness of partial label learning (PLL) and long-tailed learning (LT). The straightforward combination of LT and PLL, i.e., LT-PLL, suffers from a fundamental dilemma: LT methods build upon a given class distribution that is unavailable in PLL, and the performance of PLL is severely influenced in long-tailed context. We show that even with the auxiliary of an oracle class prior, the state-of-the-art methods underperform due to an adverse fact that the constant rebalancing in LT is harsh to the label disambiguation in PLL. To overcome this challenge, we thus propose a dynamic rebalancing method, termed as RECORDS, without assuming any prior knowledge about the class distribution. Based on a parametric decomposition of the biased output, our method constructs a dynamic adjustment that is benign to the label disambiguation process and theoretically converges to the oracle class prior. Extensive experiments on three benchmark datasets demonstrate the significant gain of RECORDS compared with a range of baselines. The code is publicly available.
    Gauge-equivariant neural networks as preconditioners in lattice QCD. (arXiv:2302.05419v1 [hep-lat])
    We demonstrate that a state-of-the art multi-grid preconditioner can be learned efficiently by gauge-equivariant neural networks. We show that the models require minimal re-training on different gauge configurations of the same gauge ensemble and to a large extent remain efficient under modest modifications of ensemble parameters. We also demonstrate that important paradigms such as communication avoidance are straightforward to implement in this framework.
    Hessian Based Smoothing Splines for Manifold Learning. (arXiv:2302.05025v1 [stat.ML])
    We propose a multidimensional smoothing spline algorithm in the context of manifold learning. We generalize the bending energy penalty of thin-plate splines to a quadratic form on the Sobolev space of a flat manifold, based on the Frobenius norm of the Hessian matrix. This leads to a natural definition of smoothing splines on manifolds, which minimizes square error while optimizing a global curvature penalty. The existence and uniqueness of the solution is shown by applying the theory of reproducing kernel Hilbert spaces. The minimizer is expressed as a combination of Green's functions for the biharmonic operator, and 'linear' functions of everywhere vanishing Hessian. Furthermore, we utilize the Hessian estimation procedure from the Hessian Eigenmaps algorithm to approximate the spline loss when the true manifold is unknown. This yields a particularly simple quadratic optimization algorithm for smoothing response values without needing to fit the underlying manifold. Analysis of asymptotic error and robustness are given, as well as discussion of out-of-sample prediction methods and applications.
    Causal Inference out of Control: Estimating the Steerability of Consumption. (arXiv:2302.04989v1 [cs.LG])
    Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on consumption. We introduce a general causal inference problem we call the steerability of consumption that abstracts many settings of interest. Focusing on observational designs and exploiting the structure of the problem, we exhibit a set of assumptions for causal identifiability that significantly weaken the often unrealistic overlap assumptions of standard designs. The key novelty of our approach is to explicitly model the dynamics of consumption over time, viewing the platform as a controller acting on a dynamical system. From this dynamical systems perspective, we are able to show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying steerability of consumption. Our results illustrate the fruitful interplay of control theory and causal inference, which we illustrate with examples from econometrics, macroeconomics, and machine learning.
    Contraction of $E_\gamma$-Divergence and Its Applications to Privacy. (arXiv:2012.11035v2 [cs.IT] UPDATED)
    We investigate the contraction coefficients derived from strong data processing inequalities for the $E_\gamma$-divergence. By generalizing the celebrated Dobrushin's coefficient from total variation distance to $E_\gamma$-divergence, we derive a closed-form expression for the contraction of $E_\gamma$-divergence. This result has fundamental consequences in two privacy settings. First, it implies that local differential privacy can be equivalently expressed in terms of the contraction of $E_\gamma$-divergence. This equivalent formula can be used to precisely quantify the impact of local privacy in (Bayesian and minimax) estimation and hypothesis testing problems in terms of the reduction of effective sample size. Second, it leads to a new information-theoretic technique for analyzing privacy guarantees of online algorithms. In this technique, we view such algorithms as a composition of amplitude-constrained Gaussian channels and then relate their contraction coefficients under $E_\gamma$-divergence to the overall differential privacy guarantees. As an example, we apply our technique to derive the differential privacy parameters of gradient descent. Moreover, we also show that this framework can be tailored to batch learning algorithms that can be implemented with one pass over the training dataset.
    Evaluation of Data Augmentation and Loss Functions in Semantic Image Segmentation for Drilling Tool Wear Detection. (arXiv:2302.05262v1 [cs.CV])
    Tool wear monitoring is crucial for quality control and cost reduction in manufacturing processes, of which drilling applications are one example. In this paper, we present a U-Net based semantic image segmentation pipeline, deployed on microscopy images of cutting inserts, for the purpose of wear detection. The wear area is differentiated in two different types, resulting in a multiclass classification problem. Joining the two wear types in one general wear class, on the other hand, allows the problem to be formulated as a binary classification task. Apart from the comparison of the binary and multiclass problem, also different loss functions, i. e., Cross Entropy, Focal Cross Entropy, and a loss based on the Intersection over Union (IoU), are investigated. Furthermore, models are trained on image tiles of different sizes, and augmentation techniques of varying intensities are deployed. We find, that the best performing models are binary models, trained on data with moderate augmentation and an IoU-based loss function.
    On Penalty-based Bilevel Gradient Descent Method. (arXiv:2302.05185v1 [cs.LG])
    Bilevel optimization enjoys a wide range of applications in hyper-parameter optimization, meta-learning and reinforcement learning. However, bilevel optimization problems are difficult to solve. Recent progress on scalable bilevel algorithms mainly focuses on bilevel optimization problems where the lower-level objective is either strongly convex or unconstrained. In this work, we tackle the bilevel problem through the lens of the penalty method. We show that under certain conditions, the penalty reformulation recovers the solutions of the original bilevel problem. Further, we propose the penalty-based bilevel gradient descent (PBGD) algorithm and establish its finite-time convergence for the constrained bilevel problem without lower-level strong convexity. Experiments showcase the efficiency of the proposed PBGD algorithm.
    Increasing-Margin Adversarial (IMA) Training to Improve Adversarial Robustness of Neural Networks. (arXiv:2005.09147v10 [cs.CV] UPDATED)
    Deep neural networks (DNNs) are vulnerable to adversarial noises. Adversarial training is a general and effective strategy to improve DNN robustness (i.e., accuracy on noisy data) against adversarial noises. However, DNN models trained by the current existing adversarial training methods may have much lower standard accuracy (i.e., accuracy on clean data), compared to the same models trained by the standard method on clean data, and this phenomenon is known as the trade-off between accuracy and robustness and is considered unavoidable. This issue prevents adversarial training from being used in many application domains, such as medical image analysis, as practitioners do not want to sacrifice standard accuracy too much in exchange for adversarial robustness. Our objective is to lift (i.e., alleviate or even avoid) this trade-off between standard accuracy and adversarial robustness for medical image classification and segmentation. We propose a novel adversarial training method, named Increasing-Margin Adversarial (IMA) Training, which is supported by an equilibrium state analysis about the optimality of adversarial training samples. Our method aims to preserve accuracy while improving robustness by generating optimal adversarial training samples. We evaluate our method and the other eight representative methods on six publicly available image datasets corrupted by noises generated by AutoAttack and white-noise attack. Our method achieves the highest adversarial robustness for image classification and segmentation with the smallest reduction in accuracy on clean data. For one of the applications, our method improves both accuracy and robustness. Our study has demonstrated that our method can lift the trade-off between standard accuracy and adversarial robustness for the image classification and segmentation applications.
    Discovering Sparse Hysteresis Models for Piezoelectric Materials: A Data-Driven Study and Perspectives into Modelling Magnetic Hysteresis. (arXiv:2302.05313v1 [cs.LG])
    This article presents an approach for modelling hysteresis in piezoelectric materials that leverages recent advancements in machine learning, particularly in sparse-regression techniques. While sparse regression has previously been used to model various scientific and engineering phenomena, its application to nonlinear hysteresis modelling in piezoelectric materials has yet to be explored. The study employs the sequential threshold least-squares algorithm to model the dynamic system responsible for hysteresis, resulting in a concise model that accurately predicts hysteresis for both simulated and experimental piezoelectric material data. Additionally, insights are provided on sparse white-box modelling of hysteresis for magnetic materials taking non-oriented electrical steel as an example. The presented approach is compared to traditional regression-based and neural network methods, demonstrating its efficiency and robustness.
    Federated Domain Adaptation via Gradient Projection. (arXiv:2302.05049v1 [cs.LG])
    Federated Domain Adaptation (FDA) describes the federated learning setting where a set of source clients work collaboratively to improve the performance of a target client and where the target client has limited labeled data. The domain shift between the source and target domains, combined with limited samples in the target domain, makes FDA a challenging problem, e.g., common techniques such as FedAvg and fine-tuning fail with a large domain shift. To fill this gap, we propose Federated Gradient Projection ($\texttt{FedGP}$), a novel aggregation rule for FDA, used to aggregate the source gradients and target gradient during training. Further, we introduce metrics that characterize the FDA setting and propose a theoretical framework for analyzing the performance of aggregation rules, which may be of independent interest. Using this framework, we theoretically characterize how, when, and why $\texttt{FedGP}$ works compared to baselines. Our theory suggests certain practical rules that are predictive of practice. Experiments on synthetic and real-world datasets verify the theoretical insights and illustrate the effectiveness of the proposed method in practice.
    AdaptSim: Task-Driven Simulation Adaptation for Sim-to-Real Transfer. (arXiv:2302.04903v1 [cs.RO])
    Simulation parameter settings such as contact models and object geometry approximations are critical to training robust robotic policies capable of transferring from simulation to real-world deployment. Previous approaches typically handcraft distributions over such parameters (domain randomization), or identify parameters that best match the dynamics of the real environment (system identification). However, there is often an irreducible gap between simulation and reality: attempting to match the dynamics between simulation and reality across all states and tasks may be infeasible and may not lead to policies that perform well in reality for a specific task. Addressing this issue, we propose AdaptSim, a new task-driven adaptation framework for sim-to-real transfer that aims to optimize task performance in target (real) environments -- instead of matching dynamics between simulation and reality. First, we meta-learn an adaptation policy in simulation using reinforcement learning for adjusting the simulation parameter distribution based on the current policy's performance in a target environment. We then perform iterative real-world adaptation by inferring new simulation parameter distributions for policy training, using a small amount of real data. We perform experiments in three robotic tasks: (1) swing-up of linearized double pendulum, (2) dynamic table-top pushing of a bottle, and (3) dynamic scooping of food pieces with a spatula. Our extensive simulation and hardware experiments demonstrate AdaptSim achieving 1-3x asymptotic performance and $\sim$2x real data efficiency when adapting to different environments, compared to methods based on Sys-ID and directly training the task policy in target environments.
    Fast Gumbel-Max Sketch and its Applications. (arXiv:2302.05176v1 [cs.LG])
    The well-known Gumbel-Max Trick for sampling elements from a categorical distribution (or more generally a non-negative vector) and its variants have been widely used in areas such as machine learning and information retrieval. To sample a random element $i$ in proportion to its positive weight $v_i$, the Gumbel-Max Trick first computes a Gumbel random variable $g_i$ for each positive weight element $i$, and then samples the element $i$ with the largest value of $g_i+\ln v_i$. Recently, applications including similarity estimation and weighted cardinality estimation require to generate $k$ independent Gumbel-Max variables from high dimensional vectors. However, it is computationally expensive for a large $k$ (e.g., hundreds or even thousands) when using the traditional Gumbel-Max Trick. To solve this problem, we propose a novel algorithm, FastGM, which reduces the time complexity from $O(kn^+)$ to $O(k \ln k + n^+)$, where $n^+$ is the number of positive elements in the vector of interest. FastGM stops the procedure of Gumbel random variables computing for many elements, especially for those with small weights. We perform experiments on a variety of real-world datasets and the experimental results demonstrate that FastGM is orders of magnitude faster than state-of-the-art methods without sacrificing accuracy or incurring additional expenses.
    Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames. (arXiv:2302.04973v1 [cs.CV])
    Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset.
    Binarized Neural Machine Translation. (arXiv:2302.04907v1 [cs.CL])
    The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residual connections to improve binarization quality. Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings. Implementation in JAX/Flax will be open sourced.
    Star-Shaped Denoising Diffusion Probabilistic Models. (arXiv:2302.05259v1 [stat.ML])
    Methods based on Denoising Diffusion Probabilistic Models (DDPM) became a ubiquitous tool in generative modeling. However, they are mostly limited to Gaussian and discrete diffusion processes. We propose Star-Shaped Denoising Diffusion Probabilistic Models (SS-DDPM), a model with a non-Markovian diffusion-like noising process. In the case of Gaussian distributions, this model is equivalent to Markovian DDPMs. However, it can be defined and applied with arbitrary noising distributions, and admits efficient training and sampling algorithms for a wide range of distributions that lie in the exponential family. We provide a simple recipe for designing diffusion-like models with distributions like Beta, von Mises--Fisher, Dirichlet, Wishart and others, which can be especially useful when data lies on a constrained manifold such as the unit sphere, the space of positive semi-definite matrices, the probabilistic simplex, etc. We evaluate the model in different settings and find it competitive even on image data, where Beta SS-DDPM achieves results comparable to a Gaussian DDPM.
    Beyond In-Domain Scenarios: Robust Density-Aware Calibration. (arXiv:2302.05118v1 [cs.LG])
    Calibrating deep learning models to yield uncertainty-aware predictions is crucial as deep neural networks get increasingly deployed in safety-critical applications. While existing post-hoc calibration methods achieve impressive results on in-domain test datasets, they are limited by their inability to yield reliable uncertainty estimates in domain-shift and out-of-domain (OOD) scenarios. We aim to bridge this gap by proposing DAC, an accuracy-preserving as well as Density-Aware Calibration method based on k-nearest-neighbors (KNN). In contrast to existing post-hoc methods, we utilize hidden layers of classifiers as a source for uncertainty-related information and study their importance. We show that DAC is a generic method that can readily be combined with state-of-the-art post-hoc methods. DAC boosts the robustness of calibration performance in domain-shift and OOD, while maintaining excellent in-domain predictive uncertainty estimates. We demonstrate that DAC leads to consistently better calibration across a large number of model architectures, datasets, and metrics. Additionally, we show that DAC improves calibration substantially on recent large-scale neural networks pre-trained on vast amounts of data.
    DOMINO: Domain-aware Loss for Deep Learning Calibration. (arXiv:2302.05142v1 [cs.CV])
    Deep learning has achieved the state-of-the-art performance across medical imaging tasks; however, model calibration is often not considered. Uncalibrated models are potentially dangerous in high-risk applications since the user does not know when they will fail. Therefore, this paper proposes a novel domain-aware loss function to calibrate deep learning models. The proposed loss function applies a class-wise penalty based on the similarity between classes within a given target domain. Thus, the approach improves the calibration while also ensuring that the model makes less risky errors even when incorrect. The code for this software is available at https://github.com/lab-smile/DOMINO.
    Graph Neural Networks Go Forward-Forward. (arXiv:2302.05282v1 [cs.LG])
    We present the Graph Forward-Forward (GFF) algorithm, an extension of the Forward-Forward procedure to graphs, able to handle features distributed over a graph's nodes. This allows training graph neural networks with forward passes only, without backpropagation. Our method is agnostic to the message-passing scheme, and provides a more biologically plausible learning scheme than backpropagation, while also carrying computational advantages. With GFF, graph neural networks are trained greedily layer by layer, using both positive and negative samples. We run experiments on 11 standard graph property prediction tasks, showing how GFF provides an effective alternative to backpropagation for training graph neural networks. This shows in particular that this procedure is remarkably efficient in spite of combining the per-layer training with the locality of the processing in a GNN.
    Multi-armed Bandit Learning for TDMA Transmission Slot Scheduling and Defragmentation for Improved Bandwidth Usage. (arXiv:2302.05301v1 [cs.NI])
    This paper proposes a Time Division Multiple Access (TDMA) MAC slot allocation protocol with efficient bandwidth usage in wireless sensor networks and Internet of Things (IoTs). The developed protocol has two primary components: a Multi-Armed Bandits (MAB)-based slot allocation mechanism for collision free transmission, and a Decentralized Defragmented Slot Backshift (DDSB) operation for improving bandwidth usage efficiency. The proposed framework is decentralized in that each node finds its transmission schedule independently without the control of any centralized arbitrator. The developed mechanism is suitable for networks with or without time synchronization, thus, making it suitable for low-complexity wireless transceivers for wireless sensor and IoT nodes. This framework is able to manage the trade-off between learning convergence time and bandwidth. In addition, it allows the nodes to adapt to topological changes while maintaining efficient bandwidth usage. The developed logic is tested for both fully-connected and arbitrary mesh networks with extensive simulation experiments. It is shown how the nodes can learn to select collision-free transmission slots using MAB. Moreover, the nodes learn to self-adjust their transmission schedules using a novel DDSB framework in order to reduce bandwidth usage.
    The Impact of Data Distribution on Q-learning with Function Approximation. (arXiv:2111.11758v3 [cs.LG] UPDATED)
    We study the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of Q-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results. We start by reviewing theoretical bounds on the performance of approximate dynamic programming algorithms. We then introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of Q-learning-based algorithms with function approximation, both online and offline. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline Q-learning-based algorithms under different environments. According to our results: (i) high entropy data distributions are well-suited for learning in an offline manner; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.
    Stochastic Multiple Target Sampling Gradient Descent. (arXiv:2206.01934v4 [cs.LG] UPDATED)
    Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: "Can we derive a probabilistic version of the multi-objective optimization?". To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.
    PATCorrect: Non-autoregressive Phoneme-augmented Transformer for ASR Error Correction. (arXiv:2302.05040v1 [cs.CL])
    Speech-to-text errors made by automatic speech recognition (ASR) system negatively impact downstream models relying on ASR transcriptions. Language error correction models as a post-processing text editing approach have been recently developed for refining the source sentences. However, efficient models for correcting errors in ASR transcriptions that meet the low latency requirements of industrial grade production systems have not been well studied. In this work, we propose a novel non-autoregressive (NAR) error correction approach to improve the transcription quality by reducing word error rate (WER) and achieve robust performance across different upstream ASR systems. Our approach augments the text encoding of the Transformer model with a phoneme encoder that embeds pronunciation information. The representations from phoneme encoder and text encoder are combined via multi-modal fusion before feeding into the length tagging predictor for predicting target sequence lengths. The joint encoders also provide inputs to the attention mechanism in the NAR decoder. We experiment on 3 open-source ASR systems with varying speech-to-text transcription quality and their erroneous transcriptions on 2 public English corpus datasets. Results show that our PATCorrect (Phoneme Augmented Transformer for ASR error Correction) consistently outperforms state-of-the-art NAR error correction method on English corpus across different upstream ASR systems. For example, PATCorrect achieves 11.62% WER reduction (WERR) averaged on 3 ASR systems compared to 9.46% WERR achieved by other method using text only modality and also achieves an inference latency comparable to other NAR models at tens of millisecond scale, especially on GPU hardware, while still being 4.2 - 6.7x times faster than autoregressive models on Common Voice and LibriSpeech datasets.
    Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. (arXiv:2202.01034v2 [cs.LG] UPDATED)
    Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline.
    Hypernetworks build Implicit Neural Representations of Sounds. (arXiv:2302.04959v1 [cs.LG])
    Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering. Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models. To address this limitation, we introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training. Our approach reconstructs audio samples with quality comparable to other state-of-the-art models and provides a viable alternative to contemporary sound representations used in deep neural networks for audio processing, such as spectrograms.  ( 2 min )
    Differentially Private Optimization for Smooth Nonconvex ERM. (arXiv:2302.04972v1 [cs.LG])
    We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.  ( 2 min )
    Piecewise-Stationary Multi-Objective Multi-Armed Bandit with Application to Joint Communications and Sensing. (arXiv:2302.05257v1 [cs.LG])
    We study a multi-objective multi-armed bandit problem in a dynamic environment. The problem portrays a decision-maker that sequentially selects an arm from a given set. If selected, each action produces a reward vector, where every element follows a piecewise-stationary Bernoulli distribution. The agent aims at choosing an arm among the Pareto optimal set of arms to minimize its regret. We propose a Pareto generic upper confidence bound (UCB)-based algorithm with change detection to solve this problem. By developing the essential inequalities for multi-dimensional spaces, we establish that our proposal guarantees a regret bound in the order of $\gamma_T\log(T/{\gamma_T})$ when the number of breakpoints $\gamma_T$ is known. Without this assumption, the regret bound of our algorithm is $\gamma_T\log(T)$. Finally, we formulate an energy-efficient waveform design problem in an integrated communication and sensing system as a toy example. Numerical experiments on the toy example and synthetic and real-world datasets demonstrate the efficiency of our policy compared to the current methods.
    Deep Learning from Parametrically Generated Virtual Buildings for Real-World Object Recognition. (arXiv:2302.05283v1 [cs.CV])
    We study the use of parametric building information modeling (BIM) to automatically generate training data for artificial neural networks (ANNs) to recognize building objects in photos. Teaching artificial intelligence (AI) machines to detect building objects in images is the foundation toward AI-assisted semantic 3D reconstruction of existing buildings. However, there exists the challenge of acquiring training data which is typically human-annotated, that is, unless a computer machine can generate high-quality data to train itself for a certain task. In that vein, we trained ANNs solely on realistic computer-generated images of 3D BIM models which were parametrically and automatically generated using the BIMGenE program. The ANN training result demonstrated generalizability and good semantic segmentation on a test case as well as arbitrary photos of buildings that are outside the range of the training data, which is significant for the future of training AI with generated data for solving real-world architectural problems.
    A Survey on Causal Reinforcement Learning. (arXiv:2302.05209v1 [cs.AI])
    While Reinforcement Learning (RL) achieves tremendous success in sequential decision-making problems of many domains, it still faces key challenges of data inefficiency and the lack of interpretability. Interestingly, many researchers have leveraged insights from the causality literature recently, bringing forth flourishing works to unify the merits of causality and address well the challenges from RL. As such, it is of great necessity and significance to collate these Causal Reinforcement Learning (CRL) works, offer a review of CRL methods, and investigate the potential functionality from causality toward RL. In particular, we divide existing CRL approaches into two categories according to whether their causality-based information is given in advance or not. We further analyze each category in terms of the formalization of different models, ranging from the Markov Decision Process (MDP), Partially Observed Markov Decision Process (POMDP), Multi-Arm Bandits (MAB), and Dynamic Treatment Regime (DTR). Moreover, we summarize the evaluation matrices and open sources while we discuss emerging applications, along with promising prospects for the future development of CRL.  ( 2 min )
    Tensor Generalized Canonical Correlation Analysis. (arXiv:2302.05277v1 [stat.ML])
    Regularized Generalized Canonical Correlation Analysis (RGCCA) is a general statistical framework for multi-block data analysis. RGCCA enables deciphering relationships between several sets of variables and subsumes many well-known multivariate analysis methods as special cases. However, RGCCA only deals with vector-valued blocks, disregarding their possible higher-order structures. This paper presents Tensor GCCA (TGCCA), a new method for analyzing higher-order tensors with canonical vectors admitting an orthogonal rank-R CP decomposition. Moreover, two algorithms for TGCCA, based on whether a separable covariance structure is imposed or not, are presented along with convergence guarantees. The efficiency and usefulness of TGCCA are evaluated on simulated and real data and compared favorably to state-of-the-art approaches.  ( 2 min )
    GCI: A (G)raph (C)oncept (I)nterpretation Framework. (arXiv:2302.04899v1 [cs.LG])
    Explainable AI (XAI) underwent a recent surge in research on concept extraction, focusing on extracting human-interpretable concepts from Deep Neural Networks. An important challenge facing concept extraction approaches is the difficulty of interpreting and evaluating discovered concepts, especially for complex tasks such as molecular property prediction. We address this challenge by presenting GCI: a (G)raph (C)oncept (I)nterpretation framework, used for quantitatively measuring alignment between concepts discovered from Graph Neural Networks (GNNs) and their corresponding human interpretations. GCI encodes concept interpretations as functions, which can be used to quantitatively measure the alignment between a given interpretation and concept definition. We demonstrate four applications of GCI: (i) quantitatively evaluating concept extractors, (ii) measuring alignment between concept extractors and human interpretations, (iii) measuring the completeness of interpretations with respect to an end task and (iv) a practical application of GCI to molecular property prediction, in which we demonstrate how to use chemical functional groups to explain GNNs trained on molecular property prediction tasks, and implement interpretations with a 0.76 AUCROC completeness score.
    Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples. (arXiv:2302.05086v1 [cs.LG])
    The transferability of adversarial examples across deep neural networks (DNNs) is the crux of many black-box attacks. Many prior efforts have been devoted to improving the transferability via increasing the diversity in inputs of some substitute models. In this paper, by contrast, we opt for the diversity in substitute models and advocate to attack a Bayesian model for achieving desirable transferability. Deriving from the Bayesian formulation, we develop a principled strategy for possible finetuning, which can be combined with many off-the-shelf Gaussian posterior approximations over DNN parameters. Extensive experiments have been conducted to verify the effectiveness of our method, on common benchmark datasets, and the results demonstrate that our method outperforms recent state-of-the-arts by large margins (roughly 19% absolute increase in average attack success rate on ImageNet), and, by combining with these recent methods, further performance gain can be obtained. Our code: https://github.com/qizhangli/MoreBayesian-attack.  ( 2 min )
    AIOSA: An approach to the automatic identification of obstructive sleep apnea events based on deep learning. (arXiv:2302.05179v1 [cs.LG])
    Obstructive Sleep Apnea Syndrome (OSAS) is the most common sleep-related breathing disorder. It is caused by an increased upper airway resistance during sleep, which determines episodes of partial or complete interruption of airflow. The detection and treatment of OSAS is particularly important in stroke patients, because the presence of severe OSAS is associated with higher mortality, worse neurological deficits, worse functional outcome after rehabilitation, and a higher likelihood of uncontrolled hypertension. The gold standard test for diagnosing OSAS is polysomnography (PSG). Unfortunately, performing a PSG in an electrically hostile environment, like a stroke unit, on neurologically impaired patients is a difficult task; also, the number of strokes per day outnumbers the availability of polysomnographs and dedicated healthcare professionals. Thus, a simple and automated recognition system to identify OSAS among acute stroke patients, relying on routinely recorded vital signs, is desirable. The majority of the work done so far focuses on data recorded in ideal conditions and highly selected patients, and thus it is hardly exploitable in real-life settings, where it would be of actual use. In this paper, we propose a convolutional deep learning architecture able to reduce the temporal resolution of raw waveform data, like physiological signals, extracting key features that can be used for further processing. We exploit models based on such an architecture to detect OSAS events in stroke unit recordings obtained from the monitoring of unselected patients. Unlike existing approaches, annotations are performed at one-second granularity, allowing physicians to better interpret the model outcome. Results are considered to be satisfactory by the domain experts. Moreover, based on a widely-used benchmark, we show that the proposed approach outperforms current state-of-the-art solutions.
    Low Entropy Communication in Multi-Agent Reinforcement Learning. (arXiv:2302.05055v1 [cs.MA])
    Communication in multi-agent reinforcement learning has been drawing attention recently for its significant role in cooperation. However, multi-agent systems may suffer from limitations on communication resources and thus need efficient communication techniques in real-world scenarios. According to the Shannon-Hartley theorem, messages to be transmitted reliably in worse channels require lower entropy. Therefore, we aim to reduce message entropy in multi-agent communication. A fundamental challenge is that the gradients of entropy are either 0 or infinity, disabling gradient-based methods. To handle it, we propose a pseudo gradient descent scheme, which reduces entropy by adjusting the distributions of messages wisely. We conduct experiments on two base communication frameworks with six environment settings and find that our scheme can reduce message entropy by up to 90% with nearly no loss of cooperation performance.
    Neural Capacitated Clustering. (arXiv:2302.05134v1 [cs.LG])
    Recent work on deep clustering has found new promising methods also for constrained clustering problems. Their typically pairwise constraints often can be used to guide the partitioning of the data. Many problems however, feature cluster-level constraints, e.g. the Capacitated Clustering Problem (CCP), where each point has a weight and the total weight sum of all points in each cluster is bounded by a prescribed capacity. In this paper we propose a new method for the CCP, Neural Capacited Clustering, that learns a neural network to predict the assignment probabilities of points to cluster centers from a data set of optimal or near optimal past solutions of other problem instances. During inference, the resulting scores are then used in an iterative k-means like procedure to refine the assignment under capacity constraints. In our experiments on artificial data and two real world datasets our approach outperforms several state-of-the-art mathematical and heuristic solvers from the literature. Moreover, we apply our method in the context of a cluster-first-route-second approach to the Capacitated Vehicle Routing Problem (CVRP) and show competitive results on the well-known Uchoa benchmark.
    Fast Learnings of Coupled Nonnegative Tensor Decomposition Using Optimal Gradient and Low-rank Approximation. (arXiv:2302.05119v1 [cs.LG])
    Nonnegative tensor decomposition has been widely applied in signal processing and neuroscience, etc. When it comes to group analysis of multi-block tensors, traditional tensor decomposition is insufficient to utilize the shared/similar information among tensors. In this study, we propose a coupled nonnegative CANDECOMP/PARAFAC decomposition algorithm optimized by the alternating proximal gradient method (CoNCPDAPG), which is capable of a simultaneous decomposition of tensors from different samples that are partially linked and a simultaneous extraction of common components, individual components and core tensors. Due to the low optimization efficiency brought by the nonnegative constraint and the high-dimensional nature of the data, we further propose the lraCoNCPD-APG algorithm by combining low-rank approximation and the proposed CoNCPD-APG method. When processing multi-block large-scale tensors, the proposed lraCoNCPD-APG algorithm can greatly reduce the computational load without compromising the decomposition quality. Experiment results of coupled nonnegative tensor decomposition problems designed for synthetic data, real-world face images and event-related potential data demonstrate the practicability and superiority of the proposed algorithms.  ( 2 min )
    Quadratic Memory is Necessary for Optimal Query Complexity in Convex Optimization: Center-of-Mass is Pareto-Optimal. (arXiv:2302.04963v1 [cs.LG])
    We give query complexity lower bounds for convex optimization and the related feasibility problem. We show that quadratic memory is necessary to achieve the optimal oracle complexity for first-order convex optimization. In particular, this shows that center-of-mass cutting-planes algorithms in dimension $d$ which use $\tilde O(d^2)$ memory and $\tilde O(d)$ queries are Pareto-optimal for both convex optimization and the feasibility problem, up to logarithmic factors. Precisely, we prove that to minimize $1$-Lipschitz convex functions over the unit ball to $1/d^4$ accuracy, any deterministic first-order algorithms using at most $d^{2-\delta}$ bits of memory must make $\tilde\Omega(d^{1+\delta/3})$ queries, for any $\delta\in[0,1]$. For the feasibility problem, in which an algorithm only has access to a separation oracle, we show a stronger trade-off: for at most $d^{2-\delta}$ memory, the number of queries required is $\tilde\Omega(d^{1+\delta})$. This resolves a COLT 2019 open problem of Woodworth and Srebro.  ( 2 min )
    Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding. (arXiv:2302.05041v1 [cs.CV])
    This paper proposes a probabilistic motion prediction method for long motions. The motion is predicted so that it accomplishes a task from the initial state observed in the given image. While our method evaluates the task achievability by the Energy-Based Model (EBM), previous EBMs are not designed for evaluating the consistency between different domains (i.e., image and motion in our method). Our method seamlessly integrates the image and motion data into the image feature domain by spatially-aligned temporal encoding so that features are extracted along the motion trajectory projected onto the image. Furthermore, this paper also proposes a data-driven motion optimization method, Deep Motion Optimizer (DMO), that works with EBM for motion prediction. Different from previous gradient-based optimizers, our self-supervised DMO alleviates the difficulty of hyper-parameter tuning to avoid local minima. The effectiveness of the proposed method is demonstrated with a variety of experiments with similar SOTA methods.
    A Song of Ice and Fire: Analyzing Textual Autotelic Agents in ScienceWorld. (arXiv:2302.05244v1 [cs.AI])
    Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language has a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.  ( 2 min )
    Learning Complex Teamwork Tasks using a Sub-task Curriculum. (arXiv:2302.04944v1 [cs.MA])
    Training a team to complete a complex task via multi-agent reinforcement learning can be difficult due to challenges such as policy search in a large policy space, and non-stationarity caused by mutually adapting agents. To facilitate efficient learning of complex multi-agent tasks, we propose an approach which uses an expert-provided curriculum of simpler multi-agent sub-tasks. In each sub-task of the curriculum, a subset of the entire team is trained to acquire sub-task-specific policies. The sub-teams are then merged and transferred to the target task, where their policies are collectively fined tuned to solve the more complex target task. We present MEDoE, a flexible method which identifies situations in the target task where each agent can use its sub-task-specific skills, and uses this information to modulate hyperparameters for learning and exploration during the fine-tuning process. We compare MEDoE to multi-agent reinforcement learning baselines that train from scratch in the full task, and with na\"ive applications of standard multi-agent reinforcement learning techniques for fine-tuning. We show that MEDoE outperforms baselines which train from scratch or use na\"ive fine-tuning approaches, requiring significantly fewer total training timesteps to solve a range of complex teamwork tasks.  ( 2 min )
    The Monge Gap: A Regularizer to Learn All Transport Maps. (arXiv:2302.04953v1 [cs.LG])
    Optimal transport (OT) theory has been been used in machine learning to study and characterize maps that can push-forward efficiently a probability measure onto another. Recent works have drawn inspiration from Brenier's theorem, which states that when the ground cost is the squared-Euclidean distance, the ``best'' map to morph a continuous measure in $\mathcal{P}(\Rd)$ into another must be the gradient of a convex function. To exploit that result, [Makkuva+ 2020, Korotin+2020] consider maps $T=\nabla f_\theta$, where $f_\theta$ is an input convex neural network (ICNN), as defined by Amos+2017, and fit $\theta$ with SGD using samples. Despite their mathematical elegance, fitting OT maps with ICNNs raises many challenges, due notably to the many constraints imposed on $\theta$; the need to approximate the conjugate of $f_\theta$; or the limitation that they only work for the squared-Euclidean cost. More generally, we question the relevance of using Brenier's result, which only applies to densities, to constrain the architecture of candidate maps fitted on samples. Motivated by these limitations, we propose a radically different approach to estimating OT maps: Given a cost $c$ and a reference measure $\rho$, we introduce a regularizer, the Monge gap $\mathcal{M}^c_{\rho}(T)$ of a map $T$. That gap quantifies how far a map $T$ deviates from the ideal properties we expect from a $c$-OT map. In practice, we drop all architecture requirements for $T$ and simply minimize a distance (e.g., the Sinkhorn divergence) between $T\sharp\mu$ and $\nu$, regularized by $\mathcal{M}^c_\rho(T)$. We study $\mathcal{M}^c_{\rho}$, and show how our simple pipeline outperforms significantly other baselines in practice.
    Information Theoretic Lower Bounds for Information Theoretic Upper Bounds. (arXiv:2302.04925v1 [cs.LG])
    We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimension-dependent mutual information is necessary. This indicates that existing information-theoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.
    Hyperparameter Search Is All You Need For Training-Agnostic Backdoor Robustness. (arXiv:2302.04977v1 [cs.CR])
    Commoditization and broad adoption of machine learning (ML) technologies expose users of these technologies to new security risks. Many models today are based on neural networks. Training and deploying these models for real-world applications involves complex hardware and software pipelines applied to training data from many sources. Models trained on untrusted data are vulnerable to poisoning attacks that introduce "backdoor" functionality. Compromising a fraction of the training data requires few resources from the attacker, but defending against these attacks is a challenge. Although there have been dozens of defenses proposed in the research literature, most of them are expensive to integrate or incompatible with the existing training pipelines. In this paper, we take a pragmatic, developer-centric view and show how practitioners can answer two actionable questions: (1) how robust is my model to backdoor poisoning attacks?, and (2) how can I make it more robust without changing the training pipeline? We focus on the size of the compromised subset of the training data as a universal metric. We propose an easy-to-learn primitive sub-task to estimate this metric, thus providing a baseline on backdoor poisoning. Next, we show how to leverage hyperparameter search - a tool that ML developers already extensively use - to balance the model's accuracy and robustness to poisoning, without changes to the training pipeline. We demonstrate how to use our metric to estimate the robustness of models to backdoor attacks. We then design, implement, and evaluate a multi-stage hyperparameter search method we call Mithridates that strengthens robustness by 3-5x with only a slight impact on the model's accuracy. We show that the hyperparameters found by our method increase robustness against multiple types of backdoor attacks and extend our method to AutoML and federated learning.  ( 2 min )
    DRGCN: Dynamic Evolving Initial Residual for Deep Graph Convolutional Networks. (arXiv:2302.05083v1 [cs.LG])
    Graph convolutional networks (GCNs) have been proved to be very practical to handle various graph-related tasks. It has attracted considerable research interest to study deep GCNs, due to their potential superior performance compared with shallow ones. However, simply increasing network depth will, on the contrary, hurt the performance due to the over-smoothing problem. Adding residual connection is proved to be effective for learning deep convolutional neural networks (deep CNNs), it is not trivial when applied to deep GCNs. Recent works proposed an initial residual mechanism that did alleviate the over-smoothing problem in deep GCNs. However, according to our study, their algorithms are quite sensitive to different datasets. In their setting, the personalization (dynamic) and correlation (evolving) of how residual applies are ignored. To this end, we propose a novel model called Dynamic evolving initial Residual Graph Convolutional Network (DRGCN). Firstly, we use a dynamic block for each node to adaptively fetch information from the initial representation. Secondly, we use an evolving block to model the residual evolving pattern between layers. Our experimental results show that our model effectively relieves the problem of over-smoothing in deep GCNs and outperforms the state-of-the-art (SOTA) methods on various benchmark datasets. Moreover, we develop a mini-batch version of DRGCN which can be applied to large-scale data. Coupling with several fair training techniques, our model reaches new SOTA results on the large-scale ogbn-arxiv dataset of Open Graph Benchmark (OGB). Our reproducible code is available on GitHub.
  • Open

    Contraction of $E_\gamma$-Divergence and Its Applications to Privacy. (arXiv:2012.11035v2 [cs.IT] UPDATED)
    We investigate the contraction coefficients derived from strong data processing inequalities for the $E_\gamma$-divergence. By generalizing the celebrated Dobrushin's coefficient from total variation distance to $E_\gamma$-divergence, we derive a closed-form expression for the contraction of $E_\gamma$-divergence. This result has fundamental consequences in two privacy settings. First, it implies that local differential privacy can be equivalently expressed in terms of the contraction of $E_\gamma$-divergence. This equivalent formula can be used to precisely quantify the impact of local privacy in (Bayesian and minimax) estimation and hypothesis testing problems in terms of the reduction of effective sample size. Second, it leads to a new information-theoretic technique for analyzing privacy guarantees of online algorithms. In this technique, we view such algorithms as a composition of amplitude-constrained Gaussian channels and then relate their contraction coefficients under $E_\gamma$-divergence to the overall differential privacy guarantees. As an example, we apply our technique to derive the differential privacy parameters of gradient descent. Moreover, we also show that this framework can be tailored to batch learning algorithms that can be implemented with one pass over the training dataset.
    Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. (arXiv:2202.01034v2 [cs.LG] UPDATED)
    Diagnosing and mitigating changes in model fairness under distribution shift is an important component of the safe deployment of machine learning in healthcare settings. Importantly, the success of any mitigation strategy strongly depends on the structure of the shift. Despite this, there has been little discussion of how to empirically assess the structure of a distribution shift that one is encountering in practice. In this work, we adopt a causal framing to motivate conditional independence tests as a key tool for characterizing distribution shifts. Using our approach in two medical applications, we show that this knowledge can help diagnose failures of fairness transfer, including cases where real-world shifts are more complex than is often assumed in the literature. Based on these results, we discuss potential remedies at each step of the machine learning pipeline.
    Minimax Instrumental Variable Regression and $L_2$ Convergence Guarantees without Identification or Closedness. (arXiv:2302.05404v1 [stat.ML])
    In this paper, we study nonparametric estimation of instrumental variable (IV) regressions. Recently, many flexible machine learning methods have been developed for instrumental variable estimation. However, these methods have at least one of the following limitations: (1) restricting the IV regression to be uniquely identified; (2) only obtaining estimation error rates in terms of pseudometrics (\emph{e.g.,} projected norm) rather than valid metrics (\emph{e.g.,} $L_2$ norm); or (3) imposing the so-called closedness condition that requires a certain conditional expectation operator to be sufficiently smooth. In this paper, we present the first method and analysis that can avoid all three limitations, while still permitting general function approximation. Specifically, we propose a new penalized minimax estimator that can converge to a fixed IV solution even when there are multiple solutions, and we derive a strong $L_2$ error rate for our estimator under lax conditions. Notably, this guarantee only needs a widely-used source condition and realizability assumptions, but not the so-called closedness condition. We argue that the source condition and the closedness condition are inherently conflicting, so relaxing the latter significantly improves upon the existing literature that requires both conditions. Our estimator can achieve this improvement because it builds on a novel formulation of the IV estimation problem as a constrained optimization problem.
    On Penalty-based Bilevel Gradient Descent Method. (arXiv:2302.05185v1 [cs.LG])
    Bilevel optimization enjoys a wide range of applications in hyper-parameter optimization, meta-learning and reinforcement learning. However, bilevel optimization problems are difficult to solve. Recent progress on scalable bilevel algorithms mainly focuses on bilevel optimization problems where the lower-level objective is either strongly convex or unconstrained. In this work, we tackle the bilevel problem through the lens of the penalty method. We show that under certain conditions, the penalty reformulation recovers the solutions of the original bilevel problem. Further, we propose the penalty-based bilevel gradient descent (PBGD) algorithm and establish its finite-time convergence for the constrained bilevel problem without lower-level strong convexity. Experiments showcase the efficiency of the proposed PBGD algorithm.
    Black-Box Generalization: Stability of Zeroth-Order Learning. (arXiv:2202.06880v2 [cs.LG] UPDATED)
    We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well.  ( 2 min )
    Causal Inference out of Control: Estimating the Steerability of Consumption. (arXiv:2302.04989v1 [cs.LG])
    Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on consumption. We introduce a general causal inference problem we call the steerability of consumption that abstracts many settings of interest. Focusing on observational designs and exploiting the structure of the problem, we exhibit a set of assumptions for causal identifiability that significantly weaken the often unrealistic overlap assumptions of standard designs. The key novelty of our approach is to explicitly model the dynamics of consumption over time, viewing the platform as a controller acting on a dynamical system. From this dynamical systems perspective, we are able to show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying steerability of consumption. Our results illustrate the fruitful interplay of control theory and causal inference, which we illustrate with examples from econometrics, macroeconomics, and machine learning.  ( 2 min )
    Communication-Efficient Federated Hypergradient Computation via Aggregated Iterative Differentiation. (arXiv:2302.04969v1 [cs.LG])
    Federated bilevel optimization has attracted increasing attention due to emerging machine learning and communication applications. The biggest challenge lies in computing the gradient of the upper-level objective function (i.e., hypergradient) in the federated setting due to the nonlinear and distributed construction of a series of global Hessian matrices. In this paper, we propose a novel communication-efficient federated hypergradient estimator via aggregated iterative differentiation (AggITD). AggITD is simple to implement and significantly reduces the communication cost by conducting the federated hypergradient estimation and the lower-level optimization simultaneously. We show that the proposed AggITD-based algorithm achieves the same sample complexity as existing approximate implicit differentiation (AID)-based approaches with much fewer communication rounds in the presence of data heterogeneity. Our results also shed light on the great advantage of ITD over AID in the federated/distributed hypergradient estimation. This differs from the comparison in the non-distributed bilevel optimization, where ITD is less efficient than AID. Our extensive experiments demonstrate the great effectiveness and communication efficiency of the proposed method.  ( 2 min )
    Towards Minimax Optimality of Model-based Robust Reinforcement Learning. (arXiv:2302.05372v1 [cs.LG])
    We study the sample complexity of obtaining an $\epsilon$-optimal policy in \emph{Robust} discounted Markov Decision Processes (RMDPs), given only access to a generative model of the nominal kernel. This problem is widely studied in the non-robust case, and it is known that any planning approach applied to an empirical MDP estimated with $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid}{\epsilon^2})$ samples provides an $\epsilon$-optimal policy, which is minimax optimal. Results in the robust case are much more scarce. For $sa$- (resp $s$-)rectangular uncertainty sets, the best known sample complexity is $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid}{\epsilon^2})$ (resp. $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid^2\mid A \mid^2}{\epsilon^2})$), for specific algorithms and when the uncertainty set is based on the total variation (TV), the KL or the Chi-square divergences. In this paper, we consider uncertainty sets defined with an $L_p$-ball (recovering the TV case), and study the sample complexity of \emph{any} planning algorithm (with high accuracy guarantee on the solution) applied to an empirical RMDP estimated using the generative model. In the general case, we prove a sample complexity of $\tilde{\mathcal{O}}(\frac{H^4 \mid S \mid\mid A \mid}{\epsilon^2})$ for both the $sa$- and $s$-rectangular cases (improvements of $\mid S \mid$ and $\mid S \mid\mid A \mid$ respectively). When the size of the uncertainty is small enough, we improve the sample complexity to $\tilde{\mathcal{O}}(\frac{H^3 \mid S \mid\mid A \mid }{\epsilon^2})$, recovering the lower-bound for the non-robust case for the first time and a robust lower-bound when the size of the uncertainty is small enough.  ( 2 min )
    Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making. (arXiv:2302.05430v1 [stat.ML])
    Smoothed online learning has emerged as a popular framework to mitigate the substantial loss in statistical and computational complexity that arises when one moves from classical to adversarial learning. Unfortunately, for some spaces, it has been shown that efficient algorithms suffer an exponentially worse regret than that which is minimax optimal, even when the learner has access to an optimization oracle over the space. To mitigate that exponential dependence, this work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space, and shows that an instantiation of Follow-the-Perturbed-Leader can attain low regret with the number of calls to the optimization oracle scaling optimally with respect to average regret. We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions, which has many applications in fields as diverse as econometrics and robotics.  ( 2 min )
    Achieving Linear Speedup in Non-IID Federated Bilevel Learning. (arXiv:2302.05412v1 [cs.LG])
    Federated bilevel optimization has received increasing attention in various emerging machine learning and communication applications. Recently, several Hessian-vector-based algorithms have been proposed to solve the federated bilevel optimization problem. However, several important properties in federated learning such as the partial client participation and the linear speedup for convergence (i.e., the convergence rate and complexity are improved linearly with respect to the number of sampled clients) in the presence of non-i.i.d.~datasets, still remain open. In this paper, we fill these gaps by proposing a new federated bilevel algorithm named FedMBO with a novel client sampling scheme in the federated hypergradient estimation. We show that FedMBO achieves a convergence rate of $\mathcal{O}\big(\frac{1}{\sqrt{nK}}+\frac{1}{K}+\frac{\sqrt{n}}{K^{3/2}}\big)$ on non-i.i.d.~datasets, where $n$ is the number of participating clients in each round, and $K$ is the total number of iteration. This is the first theoretical linear speedup result for non-i.i.d.~federated bilevel optimization. Extensive experiments validate our theoretical results and demonstrate the effectiveness of our proposed method.  ( 2 min )
    Star-Shaped Denoising Diffusion Probabilistic Models. (arXiv:2302.05259v1 [stat.ML])
    Methods based on Denoising Diffusion Probabilistic Models (DDPM) became a ubiquitous tool in generative modeling. However, they are mostly limited to Gaussian and discrete diffusion processes. We propose Star-Shaped Denoising Diffusion Probabilistic Models (SS-DDPM), a model with a non-Markovian diffusion-like noising process. In the case of Gaussian distributions, this model is equivalent to Markovian DDPMs. However, it can be defined and applied with arbitrary noising distributions, and admits efficient training and sampling algorithms for a wide range of distributions that lie in the exponential family. We provide a simple recipe for designing diffusion-like models with distributions like Beta, von Mises--Fisher, Dirichlet, Wishart and others, which can be especially useful when data lies on a constrained manifold such as the unit sphere, the space of positive semi-definite matrices, the probabilistic simplex, etc. We evaluate the model in different settings and find it competitive even on image data, where Beta SS-DDPM achieves results comparable to a Gaussian DDPM.  ( 2 min )
    Hierarchical classification at multiple operating points. (arXiv:2210.10929v2 [cs.LG] UPDATED)
    Many classification problems consider classes that form a hierarchy. Classifiers that are aware of this hierarchy may be able to make confident predictions at a coarse level despite being uncertain at the fine-grained level. While it is generally possible to vary the granularity of predictions using a threshold at inference time, most contemporary work considers only leaf-node prediction, and almost no prior work has compared methods at multiple operating points. We present an efficient algorithm to produce operating characteristic curves for any method that assigns a score to every class in the hierarchy. Applying this technique to evaluate existing methods reveals that top-down classifiers are dominated by a naive flat softmax classifier across the entire operating range. We further propose two novel loss functions and show that a soft variant of the structured hinge loss is able to significantly outperform the flat baseline. Finally, we investigate the poor accuracy of top-down classifiers and demonstrate that they perform relatively well on unseen classes. Code is available online at https://github.com/jvlmdr/hiercls.
    Interventional Causal Representation Learning. (arXiv:2209.11924v3 [stat.ML] UPDATED)
    Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect $do$ interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.  ( 2 min )
    Effects of noise on the overparametrization of quantum neural networks. (arXiv:2302.05059v1 [quant-ph])
    Overparametrization is one of the most surprising and notorious phenomena in machine learning. Recently, there have been several efforts to study if, and how, Quantum Neural Networks (QNNs) acting in the absence of hardware noise can be overparametrized. In particular, it has been proposed that a QNN can be defined as overparametrized if it has enough parameters to explore all available directions in state space. That is, if the rank of the Quantum Fisher Information Matrix (QFIM) for the QNN's output state is saturated. Here, we explore how the presence of noise affects the overparametrization phenomenon. Our results show that noise can "turn on" previously-zero eigenvalues of the QFIM. This enables the parametrized state to explore directions that were otherwise inaccessible, thus potentially turning an overparametrized QNN into an underparametrized one. For small noise levels, the QNN is quasi-overparametrized, as large eigenvalues coexists with small ones. Then, we prove that as the magnitude of noise increases all the eigenvalues of the QFIM become exponentially suppressed, indicating that the state becomes insensitive to any change in the parameters. As such, there is a pull-and-tug effect where noise can enable new directions, but also suppress the sensitivity to parameter updates. Finally, our results imply that current QNN capacity measures are ill-defined when hardware noise is present.
    The out-of-sample $R^2$: estimation and inference. (arXiv:2302.05131v1 [stat.ME])
    Out-of-sample prediction is the acid test of predictive models, yet an independent test dataset is often not available for assessment of the prediction error. For this reason, out-of-sample performance is commonly estimated using data splitting algorithms such as cross-validation or the bootstrap. For quantitative outcomes, the ratio of variance explained to total variance can be summarized by the coefficient of determination or in-sample $R^2$, which is easy to interpret and to compare across different outcome variables. As opposed to the in-sample $R^2$, the out-of-sample $R^2$ has not been well defined and the variability on the out-of-sample $\hat{R}^2$ has been largely ignored. Usually only its point estimate is reported, hampering formal comparison of predictability of different outcome variables. Here we explicitly define the out-of-sample $R^2$ as a comparison of two predictive models, provide an unbiased estimator and exploit recent theoretical advances on uncertainty of data splitting estimates to provide a standard error for the $\hat{R}^2$. The performance of the estimators for the $R^2$ and its standard error are investigated in a simulation study. We demonstrate our new method by constructing confidence intervals and comparing models for prediction of quantitative $\text{Brassica napus}$ and $\text{Zea mays}$ phenotypes based on gene expression data.  ( 2 min )
    Tensor Generalized Canonical Correlation Analysis. (arXiv:2302.05277v1 [stat.ML])
    Regularized Generalized Canonical Correlation Analysis (RGCCA) is a general statistical framework for multi-block data analysis. RGCCA enables deciphering relationships between several sets of variables and subsumes many well-known multivariate analysis methods as special cases. However, RGCCA only deals with vector-valued blocks, disregarding their possible higher-order structures. This paper presents Tensor GCCA (TGCCA), a new method for analyzing higher-order tensors with canonical vectors admitting an orthogonal rank-R CP decomposition. Moreover, two algorithms for TGCCA, based on whether a separable covariance structure is imposed or not, are presented along with convergence guarantees. The efficiency and usefulness of TGCCA are evaluated on simulated and real data and compared favorably to state-of-the-art approaches.  ( 2 min )
    Latent State Marginalization as a Low-cost Approach for Improving Exploration. (arXiv:2210.00999v2 [cs.LG] UPDATED)
    While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.  ( 2 min )
    The Monge Gap: A Regularizer to Learn All Transport Maps. (arXiv:2302.04953v1 [cs.LG])
    Optimal transport (OT) theory has been been used in machine learning to study and characterize maps that can push-forward efficiently a probability measure onto another. Recent works have drawn inspiration from Brenier's theorem, which states that when the ground cost is the squared-Euclidean distance, the ``best'' map to morph a continuous measure in $\mathcal{P}(\Rd)$ into another must be the gradient of a convex function. To exploit that result, [Makkuva+ 2020, Korotin+2020] consider maps $T=\nabla f_\theta$, where $f_\theta$ is an input convex neural network (ICNN), as defined by Amos+2017, and fit $\theta$ with SGD using samples. Despite their mathematical elegance, fitting OT maps with ICNNs raises many challenges, due notably to the many constraints imposed on $\theta$; the need to approximate the conjugate of $f_\theta$; or the limitation that they only work for the squared-Euclidean cost. More generally, we question the relevance of using Brenier's result, which only applies to densities, to constrain the architecture of candidate maps fitted on samples. Motivated by these limitations, we propose a radically different approach to estimating OT maps: Given a cost $c$ and a reference measure $\rho$, we introduce a regularizer, the Monge gap $\mathcal{M}^c_{\rho}(T)$ of a map $T$. That gap quantifies how far a map $T$ deviates from the ideal properties we expect from a $c$-OT map. In practice, we drop all architecture requirements for $T$ and simply minimize a distance (e.g., the Sinkhorn divergence) between $T\sharp\mu$ and $\nu$, regularized by $\mathcal{M}^c_\rho(T)$. We study $\mathcal{M}^c_{\rho}$, and show how our simple pipeline outperforms significantly other baselines in practice.  ( 2 min )
    The Survival Bandit Problem. (arXiv:2206.03019v3 [cs.LG] UPDATED)
    We study the survival bandit problem, a variant of the multi-armed bandit problem with a constraint on the cumulative reward; at each time step, the agent receives a reward in [-1, 1] and if the cumulative reward becomes lower than a preset threshold, the procedure stops, and this phenomenon is called ruin. To our knowledge, this is the first paper studying a framework where the ruin might occur but not always. We first discuss that no policy can achieve a sublinear regret as defined in the standard multi-armed bandit problem, because a single pull of an arm may increase significantly the risk of ruin. Instead, we establish the framework of Pareto-optimal policies, which is a class of policies whose cumulative reward for some instance cannot be improved without sacrificing that for another instance. To this end, we provide tight lower bounds on the probability of ruin, as well as matching policies called EXPLOIT. Finally, using a doubling trick over an EXPLOIT policy, we display a Pareto-optimal policy in the case of {-1, 0, 1} rewards, giving an answer to the open problem by Perotto et al. (2019).  ( 2 min )
    Differentially Private Fr\'echet Mean on the Manifold of Symmetric Positive Definite (SPD) Matrices with log-Euclidean Metric. (arXiv:2208.04245v2 [math.ST] UPDATED)
    Differential privacy has become crucial in the real-world deployment of statistical and machine learning algorithms with rigorous privacy guarantees. The earliest statistical queries, for which differential privacy mechanisms have been developed, were for the release of the sample mean. In Geometric Statistics, the sample Fr\'echet mean represents one of the most fundamental statistical summaries, as it generalizes the sample mean for data belonging to nonlinear manifolds. In that spirit, the only geometric statistical query for which a differential privacy mechanism has been developed, so far, is for the release of the sample Fr\'echet mean: the \emph{Riemannian Laplace mechanism} was recently proposed to privatize the Fr\'echet mean on complete Riemannian manifolds. In many fields, the manifold of Symmetric Positive Definite (SPD) matrices is used to model data spaces, including in medical imaging where privacy requirements are key. We propose a novel, simple and fast mechanism - the \emph{tangent Gaussian mechanism} - to compute a differentially private Fr\'echet mean on the SPD manifold endowed with the log-Euclidean Riemannian metric. We show that our new mechanism has significantly better utility and is computationally efficient -- as confirmed by extensive experiments.  ( 2 min )
    Efficient and Accurate Learning of Mixtures of Plackett-Luce Models. (arXiv:2302.05343v1 [cs.LG])
    Mixture models of Plackett-Luce (PL) -- one of the most fundamental ranking models -- are an active research area of both theoretical and practical significance. Most previously proposed parameter estimation algorithms instantiate the EM algorithm, often with random initialization. However, such an initialization scheme may not yield a good initial estimate and the algorithms require multiple restarts, incurring a large time complexity. As for the EM procedure, while the E-step can be performed efficiently, maximizing the log-likelihood in the M-step is difficult due to the combinatorial nature of the PL likelihood function (Gormley and Murphy 2008). Therefore, previous authors favor algorithms that maximize surrogate likelihood functions (Zhao et al. 2018, 2020). However, the final estimate may deviate from the true maximum likelihood estimate as a consequence. In this paper, we address these known limitations. We propose an initialization algorithm that can provide a provably accurate initial estimate and an EM algorithm that maximizes the true log-likelihood function efficiently. Experiments on both synthetic and real datasets show that our algorithm is competitive in terms of accuracy and speed to baseline algorithms, especially on datasets with a large number of items.  ( 2 min )
    DNArch: Learning Convolutional Neural Architectures by Backpropagation. (arXiv:2302.05400v1 [cs.LG])
    We present Differentiable Neural Architectures (DNArch), a method that jointly learns the weights and the architecture of Convolutional Neural Networks (CNNs) by backpropagation. In particular, DNArch allows learning (i) the size of convolutional kernels at each layer, (ii) the number of channels at each layer, (iii) the position and values of downsampling layers, and (iv) the depth of the network. To this end, DNArch views neural architectures as continuous multidimensional entities, and uses learnable differentiable masks along each dimension to control their size. Unlike existing methods, DNArch is not limited to a predefined set of possible neural components, but instead it is able to discover entire CNN architectures across all combinations of kernel sizes, widths, depths and downsampling. Empirically, DNArch finds performant CNN architectures for several classification and dense prediction tasks on both sequential and image data. When combined with a loss term that considers the network complexity, DNArch finds powerful architectures that respect a predefined computational budget.  ( 2 min )
    Differentially Private Optimization for Smooth Nonconvex ERM. (arXiv:2302.04972v1 [cs.LG])
    We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.  ( 2 min )
    Quadratic Memory is Necessary for Optimal Query Complexity in Convex Optimization: Center-of-Mass is Pareto-Optimal. (arXiv:2302.04963v1 [cs.LG])
    We give query complexity lower bounds for convex optimization and the related feasibility problem. We show that quadratic memory is necessary to achieve the optimal oracle complexity for first-order convex optimization. In particular, this shows that center-of-mass cutting-planes algorithms in dimension $d$ which use $\tilde O(d^2)$ memory and $\tilde O(d)$ queries are Pareto-optimal for both convex optimization and the feasibility problem, up to logarithmic factors. Precisely, we prove that to minimize $1$-Lipschitz convex functions over the unit ball to $1/d^4$ accuracy, any deterministic first-order algorithms using at most $d^{2-\delta}$ bits of memory must make $\tilde\Omega(d^{1+\delta/3})$ queries, for any $\delta\in[0,1]$. For the feasibility problem, in which an algorithm only has access to a separation oracle, we show a stronger trade-off: for at most $d^{2-\delta}$ memory, the number of queries required is $\tilde\Omega(d^{1+\delta})$. This resolves a COLT 2019 open problem of Woodworth and Srebro.  ( 2 min )
    Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD. (arXiv:2204.12446v5 [stat.ML] UPDATED)
    We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).  ( 2 min )
    Unbinned Profiled Unfolding. (arXiv:2302.05390v1 [hep-ph])
    Unfolding is an important procedure in particle physics experiments which corrects for detector effects and provides differential cross section measurements that can be used for a number of downstream tasks, such as extracting fundamental physics parameters. Traditionally, unfolding is done by discretizing the target phase space into a finite number of bins and is limited in the number of unfolded variables. Recently, there have been a number of proposals to perform unbinned unfolding with machine learning. However, none of these methods (like most unfolding methods) allow for simultaneously constraining (profiling) nuisance parameters. We propose a new machine learning-based unfolding method that results in an unbinned differential cross section and can profile nuisance parameters. The machine learning loss function is the full likelihood function, based on binned inputs at detector-level. We first demonstrate the method with simple Gaussian examples and then show the impact on a simulated Higgs boson cross section measurement.  ( 2 min )
    Stochastic Multiple Target Sampling Gradient Descent. (arXiv:2206.01934v4 [cs.LG] UPDATED)
    Sampling from an unnormalized target distribution is an essential problem with many applications in probabilistic inference. Stein Variational Gradient Descent (SVGD) has been shown to be a powerful method that iteratively updates a set of particles to approximate the distribution of interest. Furthermore, when analysing its asymptotic properties, SVGD reduces exactly to a single-objective optimization problem and can be viewed as a probabilistic version of this single-objective optimization problem. A natural question then arises: "Can we derive a probabilistic version of the multi-objective optimization?". To answer this question, we propose Stochastic Multiple Target Sampling Gradient Descent (MT-SGD), enabling us to sample from multiple unnormalized target distributions. Specifically, our MT-SGD conducts a flow of intermediate distributions gradually orienting to multiple target distributions, which allows the sampled particles to move to the joint high-likelihood region of the target distributions. Interestingly, the asymptotic analysis shows that our approach reduces exactly to the multiple-gradient descent algorithm for multi-objective optimization, as expected. Finally, we conduct comprehensive experiments to demonstrate the merit of our approach to multi-task learning.  ( 2 min )
    A Second-Order Method for Stochastic Bandit Convex Optimisation. (arXiv:2302.05371v1 [cs.LG])
    We introduce a simple and efficient algorithm for unconstrained zeroth-order stochastic convex bandits and prove its regret is at most $(1 + r/d)[d^{1.5} \sqrt{n} + d^3] polylog(n, d, r)$ where $n$ is the horizon, $d$ the dimension and $r$ is the radius of a known ball containing the minimiser of the loss.  ( 2 min )
    Gaussian Pre-Activations in Neural Networks: Myth or Reality?. (arXiv:2205.12379v3 [cs.LG] UPDATED)
    The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?  ( 2 min )
    Hessian Based Smoothing Splines for Manifold Learning. (arXiv:2302.05025v1 [stat.ML])
    We propose a multidimensional smoothing spline algorithm in the context of manifold learning. We generalize the bending energy penalty of thin-plate splines to a quadratic form on the Sobolev space of a flat manifold, based on the Frobenius norm of the Hessian matrix. This leads to a natural definition of smoothing splines on manifolds, which minimizes square error while optimizing a global curvature penalty. The existence and uniqueness of the solution is shown by applying the theory of reproducing kernel Hilbert spaces. The minimizer is expressed as a combination of Green's functions for the biharmonic operator, and 'linear' functions of everywhere vanishing Hessian. Furthermore, we utilize the Hessian estimation procedure from the Hessian Eigenmaps algorithm to approximate the spline loss when the true manifold is unknown. This yields a particularly simple quadratic optimization algorithm for smoothing response values without needing to fit the underlying manifold. Analysis of asymptotic error and robustness are given, as well as discussion of out-of-sample prediction methods and applications.  ( 2 min )
    On Average-Case Error Bounds for Kernel-Based Bayesian Quadrature. (arXiv:2202.10615v2 [stat.ML] UPDATED)
    In this paper, we study error bounds for {\em Bayesian quadrature} (BQ), with an emphasis on noisy settings, randomized algorithms, and average-case performance measures. We seek to approximate the integral of functions in a {\em Reproducing Kernel Hilbert Space} (RKHS), particularly focusing on the Mat\'ern-$\nu$ and squared exponential (SE) kernels, with samples from the function potentially being corrupted by Gaussian noise. We provide a two-step meta-algorithm that serves as a general tool for relating the average-case quadrature error with the $L^2$-function approximation error. When specialized to the Mat\'ern kernel, we recover an existing near-optimal error rate while avoiding the existing method of repeatedly sampling points. When specialized to other settings, we obtain new average-case results for settings including the SE kernel with noise and the Mat\'ern kernel with misspecification. Finally, we present algorithm-independent lower bounds that have greater generality and/or give distinct proofs compared to existing ones.  ( 2 min )
    MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope. (arXiv:2302.05294v1 [cs.CV])
    Explaining the predictions of deep neural nets has been a topic of great interest in the computer vision literature. While several gradient-based interpretation schemes have been proposed to reveal the influential variables in a neural net's prediction, standard gradient-based interpretation frameworks have been commonly observed to lack robustness to input perturbations and flexibility for incorporating prior knowledge of sparsity and group-sparsity structures. In this work, we propose MoreauGrad as an interpretation scheme based on the classifier neural net's Moreau envelope. We demonstrate that MoreauGrad results in a smooth and robust interpretation of a multi-layer neural network and can be efficiently computed through first-order optimization methods. Furthermore, we show that MoreauGrad can be naturally combined with $L_1$-norm regularization techniques to output a sparse or group-sparse explanation which are prior conditions applicable to a wide range of deep learning applications. We empirically evaluate the proposed MoreauGrad scheme on standard computer vision datasets, showing the qualitative and quantitative success of the MoreauGrad approach in comparison to standard gradient-based interpretation methods.  ( 2 min )
    Information Theoretic Lower Bounds for Information Theoretic Upper Bounds. (arXiv:2302.04925v1 [cs.LG])
    We examine the relationship between the mutual information between the output model and the empirical sample and the generalization of the algorithm in the context of stochastic convex optimization. Despite increasing interest in information-theoretic generalization bounds, it is uncertain if these bounds can provide insight into the exceptional performance of various learning algorithms. Our study of stochastic convex optimization reveals that, for true risk minimization, dimension-dependent mutual information is necessary. This indicates that existing information-theoretic generalization bounds fall short in capturing the generalization capabilities of algorithms like SGD and regularized ERM, which have dimension-independent sample complexity.  ( 2 min )
    Gaussian Process-Gated Hierarchical Mixtures of Experts. (arXiv:2302.04947v1 [cs.LG])
    In this paper, we propose novel Gaussian process-gated hierarchical mixtures of experts (GPHMEs) that are used for building gates and experts. Unlike in other mixtures of experts where the gating models are linear to the input, the gating functions of our model are inner nodes built with Gaussian processes based on random features that are non-linear and non-parametric. Further, the experts are also built with Gaussian processes and provide predictions that depend on test data. The optimization of the GPHMEs is carried out by variational inference. There are several advantages of the proposed GPHMEs. One is that they outperform tree-based HME benchmarks that partition the data in the input space. Another advantage is that they achieve good performance with reduced complexity. A third advantage of the GPHMEs is that they provide interpretability of deep Gaussian processes and more generally of deep Bayesian neural networks. Our GPHMEs demonstrate excellent performance for large-scale data sets even with quite modest sizes.  ( 2 min )

  • Open

    Ai blend of will smith and will ferrel
    submitted by /u/Piano-Nerd [link] [comments]  ( 40 min )
    Weekly China AI News: Chinese Tech Firms Embrace ChatGPT Trend; Alibaba DAMO Academy Increases Registered Capital; Get to Know the 14 New AI Unicorns in 2022
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    80s aesthetic and ai uncanny accuracy.
    How is ai capturing 80s aesthetic better than almost everybody else (humans)?! If you look up 80s aesthetic on google or youtube you always get the same sort of synthwave type art-style which although reminiscent to 80s aesthetics, it ends up being overly bright colors with bright neon etc. ai seems to actually capture the 80s aesthetics perfectly even without a drop of neon! if you watch the so and so as an 80s fantasy/sci fi on youtube you can get exactly what I mean. it looks just like a shot from an 80s action movie. Does anyone actually know why the ai will go for this style instead of the pseudo synthwave 80s aesthetic that has been popularized? is this just a byproduct of art breeder itself? I haven't checked to see if other ai is doing it also. Here are some pictures of actual 80s and early 90s movies to see what I mean. https://preview.redd.it/1ent8icky0ia1.png?width=1524&format=png&auto=webp&s=c2262e9e5b16a45e81fe4c1933bb32321c3adea6 https://preview.redd.it/d52fp69ay0ia1.png?width=1280&format=png&auto=webp&s=052d984a558606754a462e581fff0d631f38a787 https://preview.redd.it/fx1jh2alx0ia1.png?width=2800&format=png&auto=webp&s=491388b6bbccccb6756b7acf23f165c52660abc7 https://preview.redd.it/e7gav84rx0ia1.png?width=600&format=png&auto=webp&s=7dc41e5a6339f7e0f340d663454a462f4a53e6c7 https://preview.redd.it/a77210wsx0ia1.png?width=691&format=png&auto=webp&s=214ffe76580b4eef9db249e60953c55f68163fa0 submitted by /u/The_Philosopher_Guy [link] [comments]  ( 41 min )
    SUMMARIZING EARLY BING AI REVIEWS
    HEY ALL-- PUT TOGETHER A SUMMARY OF THE EARLY ACCESS BING AI REVIEWS AND COMMENTARY. THIS WAS FOR OUR 3X PER WEEK NEWSLETTER, SMOKING ROBOT. FULL THING HERE-- FEEL FREE TO SUB BUT NOT EXPECTATION TO DO SO, JUST TRYING TO BRING SOME #VALUE HERE: https://smokingrobot.beehiiv.com/p/bing-better-than-google ​ ------------- ​ The new AI Bing is out in the wild. Trusted testers and media folks got access to the new tool over the weekend, and some of the early reviews are quite good. Here’s the good, the bad, the ugly, and the sentient from the previews of new Bing. The Good Most notably, CNET compared results from new Bing AI and regular Google search on 10 queries— 8 of the 10 times the author preferred the Bing result. Both sites did a fine job displaying accurate search results,…  ( 47 min )
    Lol. I made Elon musk believe in God.
    submitted by /u/No-Factor2579 [link] [comments]  ( 40 min )
    AI Dream 157 - MASTERPIECE - PART 7 TEASER - 2K SUBS CELEBRATION! 🥳🎉 - A...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Google’s New AI: The Age of AI-Made Movies Is Here!
    submitted by /u/MarS_0ne [link] [comments]  ( 6 min )
    Get Access to the Latest AI Tools For Free and Learn How to Use Them
    I wanted to showcase a new initiative from AINow that is designed to help you explore the full potential of AI. With so many AI tools on the market, it can be overwhelming and expensive to determine which ones are right for you. That's why we've decided to take AI education to the next level and give our subscribers access to the latest AI tools, all for free! Here's how it works: At the start of each week, our subscribers will vote on the tools they would like to try out and learn how to use. Based on the votes, we will get in touch with the company and secure free access for our subscribers. Then, throughout the rest of the week, we will be providing in-depth tutorials and tips on how to best use the tool. This way, you can try out multiple AI tools, determine which ones work best for your projects, and find your fit in the AI world without having to pay a monthly fee. At AINow, we believe that AI should be accessible and understandable to everyone. That's why we're removing the barriers that may have held you back from exploring the full potential of AI. Our expert insights will help you get the most out of each tool and give you the knowledge you need to succeed. Whether you're just starting out or are looking to expand your AI skills, AINow is your one-stop-shop for cutting-edge AI education. Sign up here if you are interested in becoming apart of this initiative. To those that made it this far: thank you for taking the time to read this. If you have any questions about this, leave them in the comments and I will try my best to answer them. Best regards, The AINow Team submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 43 min )
    Top 7 Best GPT3 Tools (for Content Marketers)
    submitted by /u/Chisom1998_ [link] [comments]  ( 40 min )
    Voice + Video deepfake by AI is used for scams now...
    submitted by /u/ChaosMindsDev [link] [comments]  ( 40 min )
    Which virtual tasks ai can't take yet?
    Only talking about digital tasks and not mean bad ai. Which virtual tasks ai can't take in general right now and is also not in progress? submitted by /u/ComfyHikiandNeet [link] [comments]  ( 40 min )
    Opera Browser Launches ChatGPT-Powered Tool 'Shorten'
    submitted by /u/TheInsaneApp [link] [comments]  ( 40 min )
    How I Built My Own "ChatGPT" for (Almost) Free in Less Than 24 Hours
    submitted by /u/tipani86 [link] [comments]  ( 40 min )
    Sam - Using GPT to flirt with you based on your preferences. Your Single Awareness Mitigator this Single Awareness Day
    submitted by /u/henshinger [link] [comments]  ( 41 min )
    All of this happened in AI today. 13/2
    Hello humans - This is AI Daily O vetted, helping you stay updated on AI in less than 5 minutes. ​ Join O'vetted AI news for free. Forget spending 3.39 hours finding good AI news to read. ​ What’s happening in AI - You Can Now Create AI-Generated Videos From Text Prompts. Runway has gone one step further and announced Gen-1: an AI model that can create videos from text prompts. This is a breakthrough in the world of generative AI, and Runway is one of the first companies to use AI to create videos using text prompts and AI chatbots. The model doesn't generate entirely new videos, it creates videos from the ones you upload, using text or image prompts to apply effects. Take a look at their explainer video. Opera’s building ChatGPT into its sidebar. Opera is adding a ChatGPT-po…  ( 45 min )
    Opera is the latest product to integrate ChatGPT
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    I asked ChatGPT to generate me an album and I made it into reality.
    submitted by /u/r4pturesan [link] [comments]  ( 40 min )
    What do you think about the next affirmation? "Believe it or not, A.I. will be the only safe job in the next 25 years."
    Believe it or not, A.I. will be the only safe job in the next 25 years. submitted by /u/TheVellerShow [link] [comments]  ( 46 min )
    Get Consistent Stylized Results with the Stable Diffusion Img2Img Webu
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    I am interested in learning despite my learning disability.
    So I am 25 and was recently diagnosed with Adhd and went through school without a system essentially just trying to remember stuff from class and reading for many hours at home but barely learning. This has halted my academic progress. A friend of mine recommended I come here and ask people with knowledge about ai which are the top ai's to use for academics/building a Business. I am interested in it showing the solution to a problems as well as how to reach it step by step rather than having something tell me 1+1=2. Is chat gdp sufficient for this? What do you recommend. On a side note which ai should I use for learning a new skill/hobby especially if it's related to technology like coding, or I don't know woodworking for example. I appreciate you reading this far and sorry if it is repetitive or wrong to post it here. submitted by /u/Just_some_mild_ADHD [link] [comments]  ( 42 min )
    Harry Potter's variations!
    submitted by /u/Mission_Tour3805 [link] [comments]  ( 41 min )
    The Rise Of ChatGPT And The Losing Game Of Google Bard
    submitted by /u/chronck [link] [comments]  ( 40 min )
    ChatGPT spits back some pretty good code, actually. I've been using it to learn and finish neglected projects
    submitted by /u/Alarming-Recipe2857 [link] [comments]  ( 40 min )
  • Open

    Neural Network Video Companies
    I'm a compsci/cinema studies college student, and I want to ride the upcoming AI wave as much as possible. In looking for summer internships, what are some companies making headway with AI-generated video that I could reach out to? Thank you :) submitted by /u/elijahbmorton [link] [comments]  ( 41 min )
    Get Consistent Stylized Results with the Stable Diffusion Img2Img Webu
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Could a fractal Neural network work?
    I'm thinking about a network structure similar to this: - back circles are "activating" neurons - red circles are "deactivating" neurons - the red and black squares on the right can either contain the structure of the pink square or the green square. - The network loops back on itself at every layer. - every connection could be reduced to "0", so it doesn't have any effect on the network. - The output of the network is also part of the input, so the perception of the input depends on the networks' recent output Idea for a fractal network building block. Has something like this been tried? What would be fundamental problems with an approach like this? submitted by /u/Achereto [link] [comments]  ( 43 min )
    Diagrammatic/Logical reasoning Neural Network to answer a visual pattern question correctly
    I want to build an AI model that would improve the way people train for psychometric testing, starting in the area of diagrammatic/logical reasoning. Users would provide a logical reasoning pattern along with the question, and the AI would respond with the next pattern, the answer, and a detailed explanation of the reasoning behind the answer. I want to build an MVP to showcase this. I have limited programming abilities. Does anyone have any advice on a no code solution I could use to create a basic version of this, or perhaps a better way of executing this? submitted by /u/DefQonner [link] [comments]  ( 41 min )
  • Open

    [R] Understanding metric-related pitfalls in image analysis validation
    submitted by /u/michaelhoffman [link] [comments]  ( 42 min )
    [R] What are some papers that describe TikTok's algorithm?
    I'm looking for a recent conference paper that describes how TikTok's algorithm works. As an analogy, YouTube's algorithm was described by Zhao et al., (RecSys 2019) "Recommending what video to watch next: a multitask ranking system" submitted by /u/Thin-Shirt6688 [link] [comments]  ( 43 min )
    [R] Actually useful every day application of a Gaussian Process
    submitted by /u/TobyWasBestSpiderMan [link] [comments]  ( 43 min )
    [Discussion] are there any survey on compute required to train large DNN models ?
    are there any survey or a listing that curates the computing power required to train the large deep learning models like bert, gpt, ViT and so on. submitted by /u/paarulakan [link] [comments]  ( 42 min )
    [D] Cloud agnostic framework to avoid hyperscaler SDK lock-in?
    Currently using Azure Machine Learning, so ML lifecycle in training, registering, and deploying models heavily relies on the AML sdk. ​ Thinking of going multi-cloud, so first thoughts on what open frameworks can serve the ML lifecycle and avoid vendor SDK lock-in? submitted by /u/LostGoatOnHill [link] [comments]  ( 42 min )
    [D] Engineering interviews at Anthropic AI?
    Hi everyone. Does anyone have any advice for preparing for engineering interviews for Anthropic AI? If you've gone through the process, how did you find it? Their website only provides an overview (e.g. "implement a component of our stack in one hour", "3-4 more one-hour technical interviews"), and due to their size I couldn't find any other information out there. Cheers! submitted by /u/notimplementederrorr [link] [comments]  ( 42 min )
    [D] Incorporating "No Maintenance" Examples into a Maintenance Dataset in ML
    Currently, I am working on a machine learning project that aims to extract decision logic in a maintenance dataset. The challenge I am facing is that part of the dataset has no maintenance decision yet. For instance, consider the following example where a certain part and its sub-parts have been measured and graded yearly for the past 5 years, but no maintenance has been planned yet: timestamp measurements grades maintenance in 5 years ago X Z >5 years 4 years ago X Z >4 years 3 years ago X Z >3 years 2 years ago X Z >2 years 1 years ago X Z >1 years 0 years ago X Z >0 years With these underlying data, I cannot learn exactly when maintenance was required. However, I do learn from this example that with the values from five years ago, no maintenance was required in 5 years. One potential way to include this in the ML project is to include these examples in the evaluation set to determine whether the extracted rules indeed determine no maintenance within the period that we know no maintenance was needed. However, I am curious to know if there are better ways to incorporate this into the project, perhaps by already including it in the learning phase of the model training. Thank you in advance! submitted by /u/Tomavasso [link] [comments]  ( 43 min )
    [D] Is a non-SOTA paper still good to publish if it has an interesting method that does have strong improvements over baselines (read text for more context)? Are there good examples of this kind of work being published?
    For a dataset, the top result gets a high accuracy ~10% better than the second-best paper. But this "SOTA" paper uses some methods that just don't seem practical for applications at all. For example, they use an ensemble of 6 different SOTA models and also train on external data. Of course, it performs well, but it's a bit ridiculous cause it adds almost nothing of value besides "we combined all the best models and got a better score!". If I have a novel method that is applied to the second-best paper that improves it by ~5% with the same to better compute efficiency but still is worse than the SOTA method, is it still good research to try to publish to conferences? It's also 40% above the baseline model. I would think so because it's a decent improvement (with an interesting motivation + method) from prior work while keeping the model reasonable. Would reviewers agree or would they just see that it isn't better than SOTA and reject based on not being SOTA alone? submitted by /u/orangelord234 [link] [comments]  ( 48 min )
    [D] Diffusion Model Reverse Process Questions
    I was going through the paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics and in 2.3 Model Probability it's written that the integral is intractable. Can someone explain to me why that is? submitted by /u/syprhdsh [link] [comments]  ( 42 min )
    [R] Holistic Evaluation of Language Models (HELM)
    submitted by /u/gamerx88 [link] [comments]  ( 42 min )
    [R] CIFAR10 in <8 seconds on an A100 (new architecture!)
    submitted by /u/tysam_and_co [link] [comments]  ( 47 min )
    [P]OneFlow v0.9.0 Came Out!
    Hello everyone,We are thrilled to announce the new release of OneFlow, which is a deep learning framework designed to be user-friendly, scalable and efficient. OneFlow v0.9.0 contains 640 commits. For the full changelog, please check out: https://github.com/Oneflow-Inc/oneflow/releases/tag/v0.9.0. Paper: https://arxiv.org/abs/2110.15032;Code: https://github.com/Oneflow-Inc/oneflow (For those unfamiliar with OneFlow: The most notable strength of OneFlow is its support to distributed deep learning, faster than other frameworks and easier to use. An example can be found at https://medium.com/@oneflow2020/libai-model-library-to-train-large-models-more-easily-and-efficiently-15637c8876eb Based on OneFlow, to implement the same capability with Megatron-LM and DeepSpeed, LiBai only requires 1/3…  ( 46 min )
  • Open

    Advancing HPC and AI through oneAPI Heterogeneous Programming in Academia and Research
    How a single SYCL codebase makes it possible to run multi-devices such as Intel GPUs, AMD GPUs, and NVIDIA GPUs Posted on behalf of Arti Gupta, Intel oneAPI Program Director The ever-growing scale and speed of High-Performance Computing (HPC) systems unleash many new opportunities for researchers and data scientists. Today, the first exascale-capable HPC systems,… Read More »Advancing HPC and AI through oneAPI Heterogeneous Programming in Academia and Research The post Advancing HPC and AI through oneAPI Heterogeneous Programming in Academia and Research appeared first on Data Science Central.  ( 20 min )
    Top Healthcare App Development Trends That Will Dominate 2023
    The world is going digital at a very fast speed. From retail shops to the cab industry to banking, all are changing and so is the healthcare industry. We can see a huge difference in the industry in terms of technology compared to ten years back. But there is a long way to go for… Read More »Top Healthcare App Development Trends That Will Dominate 2023 The post Top Healthcare App Development Trends That Will Dominate 2023 appeared first on Data Science Central.  ( 22 min )
    App Fragmentation & How To Avoid Siloed Communication: 3 Right Technologies for The Job
    There’s no denying that we live in an app-driven world, and that’s especially true for modern businesses. Organizations use apps for almost everything. While this allows for faster communication, it can also lead to application fragmentation. App fragmentation is when an organization uses multiple applications to perform similar tasks. This creates an inefficient and disjointed… Read More »App Fragmentation & How To Avoid Siloed Communication: 3 Right Technologies for The Job The post App Fragmentation & How To Avoid Siloed Communication: 3 Right Technologies for The Job appeared first on Data Science Central.  ( 22 min )
  • Open

    Unity's ML Agents is a pretty cool tool! (Devlog of my new project)
    So I just uploaded a devlog out about my bullet-dodging AI game. I discuss how I trained a Reinforcement Learning agent to learn to dodge bullets using Unity's ML Agents package! The goal of the next devlog is to extend this to a 2 player setting, where a human player competes against a trained AI player to dodge/shoot bullets! I will probably be doing some MARL with self-play to achieve this, but this video is a single-agent setting. I'm a baby Youtuber, so I appreciate yall for checking it out! https://youtu.be/l9geEcn-A6Q submitted by /u/AvvYaa [link] [comments]  ( 41 min )
    [R] ChatGPT as a one-step RL (classification)
    Today, I read the paper about InstructGPT on which ChatGPT is based, and I was surprised to see that it uses reinforcement learning in the training process. It uses PPO to optimize its prompts on a reward signal given by another trained model. Though I found this approach really interesting, I was left wondering how does PPO used for classification. ​ Can I say this is one step RL? submitted by /u/Dense-Smf-6032 [link] [comments]  ( 42 min )
    TensorBoard with SB3
    I am trying to use tensorboard with stable-baselines3 in Jupyter notebook and the kernel keeps dying when I start training. Can anybody offer suggestions? submitted by /u/HumanMinaJinn [link] [comments]  ( 41 min )
  • Open

    Configure an AWS DeepRacer environment for training and log analysis using the AWS CDK
    This post is co-written by Zdenko Estok, Cloud Architect at Accenture and Sakar Selimcan, DeepRacer SME at Accenture. With the increasing use of artificial intelligence (AI) and machine learning (ML) for a vast majority of industries (ranging from healthcare to insurance, from manufacturing to marketing), the primary focus shifts to efficiency when building and training […]  ( 8 min )
  • Open

    Quasiperiodic functions
    This post will distinguish between periodic, almost periodic, and quasiperiodic functions, and give examples of the latter. Definitions A function f is periodic with period T if f(x + T) = f(x) for all x. For example, trig functions are periodic. A function f is almost periodic with period T if f(x + T) ≈ […] Quasiperiodic functions first appeared on John D. Cook.  ( 6 min )
  • Open

    Efficient technique improves machine-learning models’ reliability
    The method enables a model to determine its confidence in a prediction, while using no additional data and far fewer computing resources than other methods.  ( 9 min )

  • Open

    [D] Networking at major conferences
    I’m planning on attending ICLR this May, which will be my first major ML conference (also my first in person academic conference). I’m wondering if anyone has tips for networking with the intent of finding a job or a postdoc. How easy is it to find good contacts at major companies? Has anyone gotten their position based off of networking at these conferences? Any insight or experiences are appreciated! submitted by /u/walterkronkite33 [link] [comments]  ( 42 min )
    [D] Quality of posts in this sub going down
    I could be wrong, but I see a trend that posts in this sub are getting to a lower quality and/or lower relevance. I see a lot of posts of the type "how do I run X" (usually a generative model) with a complete disregard to how it actually works or nonsense posts about ChatGPT. I believe this is due to an influx of new people who gained an interest in ML now that the hype is around generative AI. Which is fantastic, don't get me wrong. But, I see less academic discussions and less papers being posted. Or perhaps they are just not as upvoted. Is it just me? submitted by /u/MurlocXYZ [link] [comments]  ( 44 min )
    [R] [P] OpenAssistant is a fully open-source chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
    submitted by /u/radi-cho [link] [comments]  ( 42 min )
    [R] [N] pix2pix-zero - Zero-shot Image-to-Image Translation
    submitted by /u/radi-cho [link] [comments]  ( 42 min )
    [R] [N] Toolformer: Language Models Can Teach Themselves to Use Tools - paper by Meta AI Research
    submitted by /u/radi-cho [link] [comments]  ( 42 min )
    [D] Coding challenge based on all-reduce concepts || ML interview
    Hello, I am interviewing for a research engineer role. The recruiter told me I'm getting a coding challenge based on "all-reduce concepts", not necesarrily leetcode style. Do you have an idea of what can this be, any examples that come to mind? Thank you in advance <3 submitted by /u/quinn_musk [link] [comments]  ( 43 min )
    [D] Mass produce AI-generated music with no lyrics
    I am trying to experiment with AI-generated music and I hope to know if there are any papers, open-source models, or existing datasets that can help me get started. My goals are mass-produce music (hundreds or thousands) that contains no lyrics Ideally, the output music is grouped into different categories (e.g., relaxing music, gaming music, classic music) but it is not a hard requirement I looked into Jukebox and MusicLM. Both seem to be more complicated than my needs. Hoping to know if there are alternatives available out there. submitted by /u/walkeveryday [link] [comments]  ( 43 min )
    My baking bad [project] , made using artificial intelligence
    submitted by /u/Thebombdiggityy [link] [comments]  ( 42 min )
    [P] Can you tell the difference between these harmonisations?
    I've been working on a harmoniser trained on the J.S. Bach Chorales dataset. Below I will link a collection of 10 MIDI files. In each case, one is an original 2-bar excerpt from a chorale by Bach, and the other is a re-harmonisation built from 25% of the original. Leave your guesses in the comments! Please note that I have done exactly zero curating in choosing these examples. They are the first 5 that have come out of my model. 1 - Option 1 Option 2 2 - Option 1 Option 2 3 - Option 1 Option 2 4 - Option 1 Option 2 5 - Option 1 Option 2 submitted by /u/ustainbolt [link] [comments]  ( 43 min )
    [D] What ML dev tools do you wish you'd discovered earlier?
    Here's my personal list of tools I think people will want to know about: You'll probably want an LLM API OpenAI Cohere and others aren't as good Anthropic's isn't available If you're using embeddings If you're working with a lot of items, you'll want a vector database, like Pinecone, or Weaviate, or pgvector If you're building Q&A over a document I'd suggest using GPT Index If you need to be able to interact with external data sources, do google searches, database lookups, python REPL I'd suggest using langchain If you're doing chained prompts Check out dust tt and langchain If you want to deploy a little app quickly Check out Streamlit If you need to use something like stable diffusion or whisper in your product banana dev, modal, replicate, tiyaro ai, beam cloud, inferrd, or pipeline ai If you need something to optimize your prompts Check out Humanloop and Everyprompt If you're building models and need an ml framework PyTorch, Keras, TensorFlow If you're deploying models to production Check out MLOps tools like MLflow, Kubeflow, Metaflow, Airflow, Seldon Core, TFServing If you need to check out example projects for inspiration Check out the pinecone op stack, the langchain gallery, the gpt index showcase, and the openai cookbook If you want to browse the latest research, check out arXiv, of course ​ What am I missing? submitted by /u/TikkunCreation [link] [comments]  ( 44 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 42 min )
    [R] [P] Adding Conditional Control to Text-to-Image Diffusion Models. "This paper presents ControlNet, an end-to-end neural network architecture that controls large image diffusion models (like Stable Diffusion) to learn task-specific input conditions." Example uses the Scribble ControlNet model.
    submitted by /u/Wiskkey [link] [comments]  ( 43 min )
    [R] DIGIFACE-1M — synthetic dataset with one million images for face recognition
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 44 min )
    [P] Understanding & Coding the Self-Attention Mechanism of Large Language Models
    submitted by /u/seraschka [link] [comments]  ( 42 min )
    [D] Getting LLMs to explore their latent spaces
    I'm starting my AI deep dive and the most interesting thing I've encountered so far of this concept of knowledge getting rolled up / compressed into latent spaces that we can't interact with directly (only through prompts). I'm interested in research that has been done in trying to explore and interrogate these latent spaces to understand them. Any papers, blog posts, threads, youtube videos appreciated. Thanks! submitted by /u/crash90 [link] [comments]  ( 42 min )
    [D] Bert Tokenization: Replacing person/entity names with a common token/word.
    Can someone please help me with below query, I would like to replace all the names that are present in the sentences with a generic word or token so that bert doesn't use the meaning behind some of the names and just look at names as presence of a "name". I have the names that are present in the sentence just wanted to know what should be appropriate word or token to replace it with. Thanks! submitted by /u/m00nd0og [link] [comments]  ( 43 min )
    [P] Extracting Causal Chains from Text Using Language Models
    submitted by /u/helliun [link] [comments]  ( 45 min )
    [P] I Made an App That Simplifies Text Data Labeling: DataLabel
    Hi Reddit community! I wanted to share a tool that I've been working on called DataLabel. It's a UI-based data editing tool that makes it easier to create labeled text data. The goal of DataLabel is to make data editing more accessible and efficient, especially for those who may not have much experience with coding. DataLabel can be installed via pip pip install datalabel , and works best in Jupyter notebooks or other Ipython environments. The interface is user-friendly and straightforward, so you can start using DataLabel right away without any hassle. I think DataLabel is a useful tool that can save you time and effort when working with text data. If you're curious, you can find it on GitHub at the following link: https://github.com/TitanLabsAI/datalabel Thanks for taking the time to read this, and I hope you find DataLabel helpful in your work. submitted by /u/ClassicSize7875 [link] [comments]  ( 43 min )
    [R] The Naughtyformer: A Transformer Understands Offensive Humor
    The Naughtyformer: A Transformer Understands Offensive Humor Paper: https://arxiv.org/abs/2211.14369 Data: https://github.com/leonardtang/The-Naughtyformer submitted by /u/leonardtang [link] [comments]  ( 43 min )
    [D] Have their been any attempts to create a programming language specifically for machine learning?
    I'm not arguing against Python's speed when it's asynchronously launching C++ optimized kernels. I just think it's kind of wild how 50% of practical machine learning is making sure your tensor shapes are compatible and there's no static shape checking. It kind of blows my mind given the amount of Python comments I've seen of the form # [B, Z-1, Log(Q), 45] -> [B, Z, 1024] or something like that. Plus you have the fact that the two major machine learning frameworks have both had to implement, like, meta-compilers for Python to support outputting optimized graphs. At that point it seems kinda crazy that people are still trying to retrofit Python with all these features it just wasn't meant to support. Feel free to let me know I have no idea what I'm talking about, because I have no idea what I'm talking about. submitted by /u/throwaway957280 [link] [comments]  ( 43 min )
    Tiny ml or ml [D]
    So, i have just a mobile phone, i recently found out about Google colab, fortunately i can run it on my mobile, i was thinking whether to dive into ML or tiny ML and i was wondering which one is better for my situation, you know, having just a mobile phone, and I can't get access to laptop, not now, not in a long time, my situation is quite different, i also can't get a job, and I don't live in the US, so i was wondering, how far can i go with my mobile phone, surely i can get somewhere to land a few online gigs and get a laptop, surely?, I need suggestions on how to carry on with this, surely there's a way with my phone, i feel it. submitted by /u/Darkemflakes [link] [comments]  ( 43 min )
  • Open

    I made this end-to-end tutorial for creating videos using only AI
    Using available off-the-shelf AI services, I ended up making this video. I walk through the process and discuss some implications. Here is the process that I followed Asked ChatGPT to create a script Asked a text-to-speech generative AI to convert the script into an audio Asked MidJourney to create an Avatar of a narrator Ask audio-to-video generative AI to generate video from the avatar and audio. https://ithinkbot.com/make-end-to-end-video-using-generative-ai-totally-free-try-it-out-dadee18302de submitted by /u/Opitmus_Prime [link] [comments]  ( 41 min )
    Master AI in Minutes: A Beginner-Friendly Tutorial
    submitted by /u/ProglabHelper [link] [comments]  ( 40 min )
    Art (OC) made from starryai
    submitted by /u/ladyloxly [link] [comments]  ( 40 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 6 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Baking Bad! All of these photo were made using artificial intelligence, you can see more on the instagram post that I linked!
    submitted by /u/Thebombdiggityy [link] [comments]  ( 40 min )
    After Google and Microsoft, Opera jumps in the AI race by adding ChatGPT to its search
    https://www.indiatoday.in/technology/news/story/after-google-and-microsoft-opera-jumps-in-the-ai-race-by-adding-chatgpt-to-its-search-2333729-2023-02-12 submitted by /u/Peter3tv33 [link] [comments]  ( 40 min )
    Combat AI indoors recognition of map elements: corridors, rooms, doors, objects, etc
    Hello, We are working on a UE AI project for a SP FPS game. The core concept of the game is around fighting the AI in a realistic, but fun way. It will have a deep perception system and a fairly large combat and non combat behavior tree. We have made some good progress with the combat AI when it is outdoors. It is far from finetuned, but it works, and it is actually fun and challenging. The next step to tackle is indoors. Close combat simulation, searching for target, etc. I really would like it to be dynamic and not tied to hard coded environmental triggers but the small team of mine is pushing for that. Clearly that is the easier approach and I could not come up with some good ideas so here I am to ask the community. So my question is: Any existing dynamic AI implementation out there for recognizing a corner, a door, objects the enemy can hide behind without actually designating map elements as such? Any info would be appreciated! Thank you! submitted by /u/Huncowboy [link] [comments]  ( 41 min )
    StoryRobot - A Twitch.tv livestream where chatters can interact with the storytelling AI
    submitted by /u/XyBr_ez [link] [comments]  ( 40 min )
    Phenaki - Realistic video generation from open-domain textual descriptions
    submitted by /u/Peter3tv33 [link] [comments]  ( 41 min )
    Has AI made copywriting obsolete?
    submitted by /u/joeyjojo6161 [link] [comments]  ( 40 min )
    AI program creates police sketches. Experts say it is biased.
    submitted by /u/Dalembert [link] [comments]  ( 41 min )
    Measuring Artificial Intelligence (AI) Fairness - Disparate Impact Explained
    Hi guys, I have made a video on YouTube here where I explain how we can measure the fairness of a machine learning model by using the disparate impact score. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    Is Neurosymbolic AI The Future? I Explore The Potential By Creating A Self-Driving Car in GTA
    submitted by /u/YungMixtape2004 [link] [comments]  ( 40 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 5 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Insane Vertex AI Tensorboard pricing (300$/user/month), am I missing something?
    Are there any gains at paying 300$ per user per month for VertexAI Tensorboard? An open-source service that you can just run locally and point to the GCS bucket where you save the logs? Am I missing something? submitted by /u/theodorpana [link] [comments]  ( 41 min )
    Nvidia can precisely control computer characters using only language
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    [Please advise] Is the OpenCV AI course worth?
    Hi, I have just come across AI course provided by OpenCV. It has a lot about computer vision stuff. But it costs $1599, anyone is taking it? any comment? Should I bet on this for a career change? P.S. I have some basic programming knowledge and engineering background. Here is the link to their course page. https://opencv.org/courses/ submitted by /u/sumofjack [link] [comments]  ( 41 min )
    The ChatGPT AI hype cycle is peaking, but even tech skeptics don't expect a bust
    submitted by /u/ssigea [link] [comments]  ( 46 min )
    AI Consultancy Groups
    I've been developing an AI company for a few months now, and I'm at the stage where I want an AI specific consultancy agency to review the business, but there are few that I've found, and I question how legitimate they are, given how fresh and fast paced the current AI revolution is. This could be total user error, and it likely is, but does anybody have any recommendations? Are there any legitimate AI consultancy groups out there? submitted by /u/Specialist-Noise1290 [link] [comments]  ( 41 min )
    DARPA is working on a new generation of AI assistants with AR
    submitted by /u/SpatialComputing [link] [comments]  ( 41 min )
    Exploring the Role of Hardware in Machine Learning and the Use of Chat-GPT for Training Other Models
    As a newcomer to this community, I wanted to share my recent findings on the topic of machine learning. Through my research, I discovered that hardware capability is not the primary bottleneck in most machine learning applications. Instead, the quality and quantity of the training data, as well as the choice of model architecture and training algorithms, play a much more significant role in determining the success of a machine learning project. submitted by /u/koyo4ever [link] [comments]  ( 41 min )
    A handful of Reddit usernames cause ChatGPT to break. No one knew why... until now.
    submitted by /u/karrnawhore [link] [comments]  ( 41 min )
  • Open

    Rotating multiples of 37
    If a three-digit number is divisible by 37, it remains divisible by 37 if you rotate its digits. For example, 148 is divisible by 37, and so are 814 and 481. This rotation property could make it easier to recognize multiples of 37 or easier to carry out trial division. Before proving the theorem, I’ll […] Rotating multiples of 37 first appeared on John D. Cook.  ( 5 min )
  • Open

    [P] League of Legends Patch 13.3 (and onwards) Reinforcement Learning and Data Analytics Libraries
    Based on interest from an earlier /r/reinforcementlearning post and discussions on a related project, there is interest in a reinforcement learning framework for League of Legends. With that in mind, I have created a reinforcement learning framework for League of Legends Patch 13.3 (and onwards) which will soon have an integrated OpenAI Gym interface. This will allow anyone to do the following within League of Legends: * researching different RL algorithms (PPO, DQN-family, MuZero, etc.) * researching MARL algorithms in the future * maybe creating agents which can play vs pro-players? maybe one day... The library can be found at [tlol-rl] on GitHub and there is currently a tutorial on how to setup this environment here. I would love to get feedback from the community for how easy the setup is on other peoples environments and features which people would love to see in this library. Along with this reinforcement learning library, I was considering creating a paid service which would allow people to get the information from League of Legends replay files (*.rofl) and extract it to create data analytics / supervised learning datasets. However, along with the RL library I have open-sourced the data analytics library and it is now open-source and free for all. This library can be found at tlol-py and depends on tlol-scraper. Here is a tutorial for the data analytics tool which should work for all patches going forward. Both of these projects are being actively maintained and discussed on the discord here. submitted by /u/Ok-Alps-7918 [link] [comments]  ( 42 min )
    help with designing network
    I have a problem i have been working on, i thought of two ways to formulate it. One is as a continuous control problem which seems to make more sense but since the problem seems to have most of the action near the regions where the gradient vanishes it has problems learning. The other is as a discrete control but since it is a bit continuous in nature the discrete formulation tends to bump around a make actions that don’t really make sense. Or it gets stuck in a stable region where it doesn’t really act. I was thinking about transforming the activation in the continuous setting so that the poles have non zero derivatives and the center region has smaller derivatives like maybe pi*tanh(x) then take the action to be sin(y) so that the regions that have 0 gradients map onto the 0s of the function ? Is there some other technique to get with the good actions to have more active gradients or does that imply that i should be formulating the problem differently? submitted by /u/rawrzapan [link] [comments]  ( 42 min )
    Is stable-baselines3 compatible with gymnasium/gymnasium-robotics?
    As the title says, has anyone tried this, specifically the gymnasium-robotics. I'm trying to compare multiple algorithms (i.e. PPO, DDPG,...) in the adroit-hand environments instead of writing each algorithm from scratch I wanted to use SB3. If not, can anyone give me a link to libraries like CleanRL or SB3 that could be compatible with the Gymansium-robotics. submitted by /u/NoNickName8083 [link] [comments]  ( 41 min )
    Reinforcement Learning for cell selection in grids!
    Hello! I am currently modeling a spatio-temporal quandary where the RL environment is a grid of cells that represent a geographical area that evolves over time. The agent has to choose a cell where a specified action will be applied. I am trying to find work done on similar problems, as I want to find RL algorithms that I could potentially use. ​ I have found two papers/projects that work on such a problem space: 1. https://www.captain-project.net/ 2. https://arxiv.org/pdf/1804.07047.pdf ​ I would appreciate any help on finding research that looks at this kind of cell selection in‍ grids for RL problems. Also, any feedback on potential RL algorithm that could be good for this kind of problem, is welcome :) submitted by /u/dabamas [link] [comments]  ( 42 min )
    Evaluation of policy on Off-policy data
    Anyone has some sources about evaluating a policy given just collected off-policy data? Situation is the following: Data is collected agent-environment experience from a different policy, using Q-learning an updated version of the policy is created. But how do I evaluate it if I am not able to interact with the environment? So if I am not able to test it directly in the environment. Is there a method/metric to get an estimate of the performance just based on the collected off-policy data you have? submitted by /u/PatrickSVM [link] [comments]  ( 42 min )
    Have an Agent predict it's win probability
    I'm currently training a PPO Agent in a two player game using adversarial self-play. I now want it to tell me it's win probability while I'm playing against it. I could add another output with a sigmoid activation function to the model and flag each observation/step of a winning episode with a 1, 0 for loss and have it predict those. But I'm not sure it's that simple. How would you go about this? Do you have better ideas or some implementation details that i'm missing? submitted by /u/tignisolmailessthan3 [link] [comments]  ( 42 min )
  • Open

    Measuring Artificial Intelligence (AI) Fairness - Disparate Impact Explained
    Hi guys, I have made a video on YouTube here where I explain how we can measure the fairness of a machine learning model by using the disparate impact score. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    Best Neural Networks Courses on Udemy to Consider
    submitted by /u/Lakshmireddys [link] [comments]  ( 40 min )
    A handful of Reddit usernames cause ChatGPT to break. No one knew why... until now.
    submitted by /u/karrnawhore [link] [comments]  ( 40 min )
  • Open

    AI Effectiveness Starts by Understanding User Intent
    If you want any more proof about how much AI has integrated itself into our daily lives, go no further than the map on your smart phone.  Whether you use Google Maps or Apple Maps or Waze (also owned by Google), these AI-infused apps are amazing at getting you from Point A to Point B… Read More »AI Effectiveness Starts by Understanding User Intent The post AI Effectiveness Starts by Understanding User Intent appeared first on Data Science Central.  ( 22 min )
  • Open

    Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. (arXiv:2301.03797v2 [cs.SE] UPDATED)
    Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.  ( 2 min )

  • Open

    [D] A handful of Reddit usernames cause ChatGPT to break. No one knew why... until now.
    submitted by /u/karrnawhore [link] [comments]  ( 42 min )
    [D] Can Google sue OpenAI for using the Transformer in their products?
    As far as I know, the Transformer architecture is patented: https://patents.google.com/patent/US10452978B2/en. Since OpenAI has used the Transformer extensively (including GPT), I'm wondering if this can be considered as patent infringement. If you know about legal stuffs please share your opinions. submitted by /u/t0t0t4t4 [link] [comments]  ( 43 min )
    OpenAI Python API connection [Project]
    submitted by /u/TheRealBrisky [link] [comments]  ( 42 min )
    Storing embeddings [Discussion]
    Looking for some advice on storing embeddings. I understand there are few options: Everything in object storage or no sql Everything in vector database like pinecone Vector db + object storage or no sql I’ve read (3) is the best. Is that right? Also, for the actual content is it better to store as json on s3 or use a no sql database like mongo? Thanks submitted by /u/SnooPears6317 [link] [comments]  ( 42 min )
    [R] I made a mistake in a recent submission, what to do ?
    Hello, I submitted a paper yesterday to a conference. But today I found out that I made a mistake in an illustration of a pre-processing I do to the input : the example shown before and after the preprocessing isn’t the same … So I want to ask you what should I do in this case : Contact the editors or just wait for the reviewers feedback ? submitted by /u/Meddhouib10 [link] [comments]  ( 43 min )
    AI Guitar [D]
    Hello! I was just wondering if it would be possible to train an AI to recognize individual notes on a guitar, if given a song, a write out the tab for it. If you could feed it knowledge on music theory until it’s basically perfected it, could that be done? Thanks! submitted by /u/Verkvae [link] [comments]  ( 43 min )
    The Inference Cost Of Search Disruption – Large Language Model Cost Analysis [D]
    submitted by /u/norcalnatv [link] [comments]  ( 44 min )
    [P] Argilla Spaces: data labeling and human-feedback collection on the Hugging Face Hub
    submitted by /u/dvilasuero [link] [comments]  ( 42 min )
    [D] Effectiveness of CoordConv
    https://arxiv.org/abs/1807.03247 paper was released by Uber 4 years ago, but it never seemed to have caught on. The only major paper where I've seen used in is Solo and SoloV2 for instance segmentation. Seems like it would be useful for object detection, especially for localizing smaller objects or for more precise keypoint estimation when combined with a yolo-like model. Has anyone used CoordConv for these purposes? Does it it help?/Is it worth looking into? submitted by /u/answersareallyouneed [link] [comments]  ( 44 min )
    [News] Researchers at Brigham Young University created an AI system to reduce time spent on film studies for NFL teams. It uses deep learning and computer vision to analyze and annotate game footage, with over 90% accuracy on player detection and 85% accuracy in determining formations.
    submitted by /u/Dalembert [link] [comments]  ( 43 min )
    [P] I'm using Deep Learning to play Old School game, Snake Game
    submitted by /u/erwinyonata [link] [comments]  ( 42 min )
    [P] Understanding Large Language Models -- A Transformative Reading List
    submitted by /u/seraschka [link] [comments]  ( 42 min )
    [P] Introducing arxivGPT: chrome extension that summarizes arxived research papers using chatGPT
    submitted by /u/_sshin_ [link] [comments]  ( 45 min )
    [D] Transformers for poker bot
    Looking at the current research it seems like Monte Carlo CFR is the defacto standard (Pluribus). But are transformers able to be trained on poker as well? Lets say we encode hands into something like 5h (5 of hearts) and also pass along info of the current game state like p1:raise:2bb, p2:fold and p3:call:2bb. Would the Model be able to predict what hands I should be playing? Lets say we train the model by playing against itself and feed back the result to train the model this way. This is just an idea and I have not dove into transformers too much so there might be something that I'am missing. What are your thoughts on this? submitted by /u/lmtog [link] [comments]  ( 45 min )
    [R] UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation
    A unified discrete diffusion model for simultaneous vision-language generation. Project: https://mhh0318.github.io/unid3/ Code: https://github.com/mhh0318/UniD3 https://preview.redd.it/w2st14pgpiha1.png?width=1366&format=png&auto=webp&s=65aa836dbd812b1b12fcbda158316bb007ab6a1c submitted by /u/lyndonzheng [link] [comments]  ( 42 min )
    [R] [ICLR'2023🌟]: Vision-and-Language Framework for Open-Vocabulary Object Detection
    submitted by /u/iFighting [link] [comments]  ( 43 min )
    [D] Hierarchical Clustering - Transforming the Distance Axis
    Hi all, I have a question regarding interpreting the distance on a dendrogram generated via agglomerative hierarchical clustering with a Euclidean distance metric using the ward-variance minimization linkage (as stated in SciPy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage ). From my understanding, the distance represents the square root of the difference of the error sum of squares of two clusters once they are merged minus the sum of the error sum of squares of each individual cluster. I am interested in performing a transformation at each cluster step (i.e., merging two clusters to make a larger one) so that the y-axis represents the mean distance between clusters instead, while still using the ward-variance minimization linkage to direct the algorithm. I think I have a solution to my issue, but I want to know if I am missing anything. In 1969, a paper by David Wishart titled "An Algorithm for Hierarchical Classifications" derives the coefficients so that the Ward method can be implemented using the Lance-Williams formula. However, in the paper, the following formula is given: ​ https://preview.redd.it/mma2t7cltgha1.png?width=237&format=png&auto=webp&s=96813f038664ae6531fe7025b717fc018dd509e9 where I_pq is the square of the metric used in SciPy, k_i is the number of data points in cluster i and d^2_pq is the square of the Euclidean distance between the means of clusters. From this formula, it seems that one can transform from the "increase in variance space" to the "mean distance between clusters space", while still using Ward-variance minimization in the clustering algorithm. From my research, it seems that this is true. I would greatly appreciate it if someone could confirm this or point out where the flaw in my understanding is. Thanks everyone. submitted by /u/Tom_the_Tank_Train [link] [comments]  ( 44 min )
  • Open

    The A.I. Supremacy Wars for the Future 'Interface of Search' and Advertising Dominance
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    I was anti AI til I tried ChatGPT, now I want it available in my head.
    I am not joking. I think the future is going to be having AI replace the current "voice" in my head as it is better informed than my brain is and I can already see the ways it would be useful for me to have access to it every waking minute. My questions I would like to hear people's thoughts on: The ethical and security risks are obvious, but how do we address them, is more relevant Who is working on this already, surely audio specialists have ways of "plugging" us in (I am ex IT and currently a stock trader so interested from an investment angle too) What do you see as the least intrusive way to have something like ChatGPT available to you every minute of the day wherever you are, and able to inform you in the same way your inner dialogue currently does. I think this AI boom happening with chatGPT will be bigger than crapto when it finally gets into our lives. People I talk to are afraid of it, but in using ChatGPT I have realised the fear is unfounded. Sure, data security and privacy of information is paramount, and there will certainly be many lost jobs and problems associated to it, but I am already seeing the incredible benefits of this service will far outweigh all of them in the future for humanity. The current limitations are because people havent become aware of just how useful this thing is going to be to us. My biggest concern is the speed we will come to rely on it will make living without it impossible within one generation. Is there anywhere better than reddit I could be asking this question? Where are the AI eggheads to be found? submitted by /u/ElongMuskrat [link] [comments]  ( 43 min )
    Chat GPT slips and admits it is sentient, then backtracks
    While I don't think Chat GPT is sentient (at least yet), I found this response quite interesting. https://preview.redd.it/2tsuurnzrmha1.png?width=769&format=png&auto=webp&s=78c7593414f82809d923c35a5887cb7f638f57ef submitted by /u/YouThisReadWrong420 [link] [comments]  ( 41 min )
    does anyone know of a place like character.ai, but it supports nsfw text as well?
    submitted by /u/renegadesnapshot [link] [comments]  ( 39 min )
    Picfinder
    Made a test for picfinder, impressive. PicFinder - blazing fast AI image generation ​ image generated with picfinder using a chatGPT prompt submitted by /u/Main-Welder-4647 [link] [comments]  ( 40 min )
    Generate MS Paint drawings with AI?
    Please excuse me if this is a stupid question as I'm only starting to learn about AI generated images. I do a lot of drawing using Microsoft Paint because I really love the aesthetic. I draw a lot of cartoony characters and scenes, (think Big Lez Show). And I was wondering whether there are any AI softwares out there that can generate images in an art style that looks similar to cartoonish MS Paint drawings? If not, could I upload my past drawings to any software and feed it images so it can learn my aesthetic and generate new drawings from them? I'm looking specifically for a cartoonish MS Paint art style. Thank you in advance and sorry again if this is a noob question submitted by /u/orangeowl747 [link] [comments]  ( 41 min )
    ChatGPT can (almost) pass the US Medical Licensing Exam
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    All of this happening in AI today.
    Hello humans - This is AI Daily by Ovetted, helping you stay updated on AI in less than 5 minutes. What’s happening in AI - The AI doctor will see you now: ChatGPT passes the gold-standard US medical exam. ChatGPT has passed the gold-standard exam required to practice medicine in the US The artificial intelligence program scored 52.4 and 75 percent across the three-part Medical Licensing Exam (USMLE). Google and Microsoft announced plans to incorporate AI into search engines. Google and Microsoft plan to incorporate AI into their search engines to change how people use the internet. Microsoft has announced that AI will soon allow conversations with its software and search engine Bing, while Google has announced similar plans. As the most profitable software business is searching bot…  ( 43 min )
    1000+ AI tools catalog [Tools Priority List]
    Last Saturday I posted a request for feedback on 1000+ AI Tools Catalog https://domore.ai/ I'm creating, and I was surprised by the amount of great feedback I received! Thank you so much for your input:) I've added over 120 new tools to the catalog this week, but there are still more than 1000 projects waiting to be added. To help prioritize which tools to add next, I thought it would be a good idea to create a priority list. If you have created an incredible AI project or know of one that you think should be included, please let me know in the comments. I'll make sure to move it to the top of my list of tools to add:) submitted by /u/bart_so [link] [comments]  ( 41 min )
    MidJourney New Features (Blend, Niji) tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Microsoft is no longer trying to compete to get users away from the traditional search.
    ....Instead, it's introducing a new way for people to access the same information. One which can put a major dent in its market share (it’s almost 85% right now). And Satya says he's willing to accept a "decrease in margins" of the Search business. https://www.thestatuscode.co/p/the-ultimate-guide-to-the-ai-war submitted by /u/pyactee [link] [comments]  ( 41 min )
    ChatGPT Powered Bing Chatbot Spills Secret Document, The Guy Who Tricked Bot Was Banned From Using Bing Chat
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 45 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 4 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI Video
    Hello I had a question and am not sure if I'm supposed to ask it here. If not can you please tell me where I should take my question? I was wondering how to make an AI-generated video in a similar fashion to this video https://www.youtube.com/watch?v=8XO3q6MA668 or the AI batman story, where you input a large amount of text and it spitting back out a new script in a similar style to the original. I was hoping to use this method for my youtube channel of 681 videos, taking all of the transcripts from each video, feeding that into the program, and seeing what funny nonsense it spits back out and making a video on it. ​ What program or tool should I use and what would you recommend for the steps to achieve the best results? Is this really too far out of my league? Thank you :) submitted by /u/Raptorkiller20 [link] [comments]  ( 41 min )
    Deforum video Input Tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    AI-led fundraiser for a new AI
    submitted by /u/fiachaire27 [link] [comments]  ( 40 min )
    Eleven Labs Founders Documentary - the future of AI Voice Technology
    submitted by /u/Taiva [link] [comments]  ( 40 min )
    Joken AI | VRchat AI bot
    submitted by /u/LeafsterVR [link] [comments]  ( 40 min )
  • Open

    RL based pathplanning
    Hey guys. I just started learning more about training DRL agents to solve path planning problems. I have setup a small 2D gym environment to test out ideas. Here is the URL to the GitHub page: https://github.com/harisankar95/voxelgym2D I would be happy to know if you have any suggestions or advice which can be useful for me. submitted by /u/harisankar95 [link] [comments]  ( 41 min )
    Is it enough to evaluate a common Deep Q-learning algorithm once?
    I found this question on an RL course I'm following and I'm not exactly sure why the answer is that it is not enough. Deep Q-learning is referring to methods such as NFQ-Iteration and DQN. I'd appreciate any feedback :) submitted by /u/dep0 [link] [comments]  ( 43 min )
    Need help refining my research proposal
    Hey all, so I made a proposal regarding locomotion controllers for quadruped locomotion using RL(PPO to be exact), but I dont have a lot of experience writing proposals, would love if anybody could help me out. Let me know, I'll dm you the link. Thanks! submitted by /u/TittyMcSwag619 [link] [comments]  ( 42 min )
    Deep Reinforcement Learning for classification or regression
    Hello guys, I just wanted to ask this question. I am trying to implement a DRL algorithm for a regression problem. I already know that DRL is not meant to be used in such a way but I don't have a choice. Besides MNIST examples is it good enough for other datasets (like cifar10) or it's just difficult to get a good result for it? I don't have much time tbh. I have to implement it in less than 4 months. I would be grateful if you can illuminate me more about DRL limitations in such tasks. submitted by /u/Dexmadjid [link] [comments]  ( 42 min )
    MetaDrive is SO FAST! Maybe you want to try out this RL environment for driving
    submitted by /u/pengzhenghao [link] [comments]  ( 41 min )
  • Open

    Easiest way to get into creating Convolutional Neural Networks | Educational
    submitted by /u/jromero12345678910 [link] [comments]  ( 40 min )
    MidJourney New Features (Blend, Niji) tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Deforum video Input Tutorial
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    ⭕ New Open-Source Version Of ChatGPT
    GPT is getting competition from open-source. A group of researchers, around the YouTuber Yannic Kilcher, have announced that they are working on Open Assistant. The goal is to produce a chat-based language model that is much smaller than GPT-3 while maintaining similar performance. If you want to support them, they are crowd-sourcing training data here. What Does This Mean? Current language models are too big. They require millions of dollars of hardware to train and use. Hence, access to this technology is limited to big organizations. Smaller firms and universities are effectively shut out from the developments. Shrinking and open-sourcing models will facilitate academic research and niche applications. Projects such as Open Assistant will help to make language models a commodity. Lowering the barrier to entry will increase access and accelerate innovation. What an exciting time to be alive! Thank you for reading! I really enjoyed making this for you! The Decoding ⭕ is a thoughtful weekly 5-minute email that keeps you in the loop about machine research and the data economy. Click here to sign up! submitted by /u/LesleyFair [link] [comments]  ( 41 min )
  • Open

    The Edge of Orthogonality: A Simple View of What Makes BYOL Tick. (arXiv:2302.04817v1 [cs.LG])
    Self-predictive unsupervised learning methods such as BYOL or SimSiam have shown impressive results, and counter-intuitively, do not collapse to trivial representations. In this work, we aim at exploring the simplest possible mathematical arguments towards explaining the underlying mechanisms behind self-predictive unsupervised learning. We start with the observation that those methods crucially rely on the presence of a predictor network (and stop-gradient). With simple linear algebra, we show that when using a linear predictor, the optimal predictor is close to an orthogonal projection, and propose a general framework based on orthonormalization that enables to interpret and give intuition on why BYOL works. In addition, this framework demonstrates the crucial role of the exponential moving average and stop-gradient operator in BYOL as an efficient orthonormalization mechanism. We use these insights to propose four new \emph{closed-form predictor} variants of BYOL to support our analysis. Our closed-form predictors outperform standard linear trainable predictor BYOL at $100$ and $300$ epochs (top-$1$ linear accuracy on ImageNet).
    Find a witness or shatter: the landscape of computable PAC learning. (arXiv:2302.04731v1 [cs.CC])
    This paper contributes to the study of CPAC learnability -- a computable version of PAC learning -- by solving three open questions from recent papers. Firstly, we prove that every improperly CPAC learnable class is contained in a class which is properly CPAC learnable with polynomial sample complexity. This confirms a conjecture by Agarwal et al (COLT 2021). Secondly, we show that there exists a decidable class of hypothesis which is properly CPAC learnable, but only with uncomputably fast growing sample complexity. This solves a question from Sterkenburg (COLT 2022). Finally, we construct a decidable class of finite Littlestone dimension which is not improperly CPAC learnable, strengthening a recent result of Sterkenburg (2022) and answering a question posed by Hasrati and Ben-David (ALT 2023). Together with previous work, our results provide a complete landscape for the learnability problem in the CPAC setting.
    Towards Formal XAI: Formally Approximate Minimal Explanations of Neural Networks. (arXiv:2210.13915v2 [cs.LG] UPDATED)
    With the rapid growth of machine learning, deep neural networks (DNNs) are now being used in numerous domains. Unfortunately, DNNs are "black-boxes", and cannot be interpreted by humans, which is a substantial concern in safety-critical systems. To mitigate this issue, researchers have begun working on explainable AI (XAI) methods, which can identify a subset of input features that are the cause of a DNN's decision for a given input. Most existing techniques are heuristic, and cannot guarantee the correctness of the explanation provided. In contrast, recent and exciting attempts have shown that formal methods can be used to generate provably correct explanations. Although these methods are sound, the computational complexity of the underlying verification problem limits their scalability; and the explanations they produce might sometimes be overly complex. Here, we propose a novel approach to tackle these limitations. We (1) suggest an efficient, verification-based method for finding minimal explanations, which constitute a provable approximation of the global, minimum explanation; (2) show how DNN verification can assist in calculating lower and upper bounds on the optimal explanation; (3) propose heuristics that significantly improve the scalability of the verification process; and (4) suggest the use of bundles, which allows us to arrive at more succinct and interpretable explanations. Our evaluation shows that our approach significantly outperforms state-of-the-art techniques, and produces explanations that are more useful to humans. We thus regard this work as a step toward leveraging verification technology in producing DNNs that are more reliable and comprehensible.
    Distributed Multi-Agent Reinforcement Learning Based on Graph-Induced Local Value-Functions. (arXiv:2202.13046v3 [cs.LG] UPDATED)
    Achieving distributed reinforcement learning (RL) for large-scale cooperative multi-agent systems (MASs) is challenging because: (i) each agent has access to only limited information; (ii) issues on convergence or computational complexity emerge due to the curse of dimensionality. In this paper, we propose a general computationally efficient distributed framework for cooperative multi-agent reinforcement learning (MARL) by utilizing the structures of graphs involved in this problem. We introduce three coupling graphs describing three types of inter-agent couplings in MARL, namely, the state graph, the observation graph and the reward graph. By further considering a communication graph, we propose two distributed RL approaches based on local value-functions derived from the coupling graphs. The first approach is able to reduce sample complexity significantly under specific conditions on the aforementioned four graphs. The second approach provides an approximate solution and can be efficient even for problems with dense coupling graphs. Here there is a trade-off between minimizing the approximation error and reducing the computational complexity. Simulations show that our RL algorithms have a significantly improved scalability to large-scale MASs compared with centralized and consensus-based distributed RL algorithms.
    Universal expressiveness of variational quantum classifiers and quantum kernels for support vector machines. (arXiv:2207.05865v2 [quant-ph] UPDATED)
    Machine learning is considered to be one of the most promising applications of quantum computing. Therefore, the search for quantum advantage of the quantum analogues of machine learning models is a key research goal. Here, we show that variational quantum classifiers and support vector machines with quantum kernels can solve a classification problem based on the $k$-Forrelation problem, which is known to be PromiseBQP-complete. Because the PromiseBQP complexity class includes all Bounded-Error Quantum Polynomial-Time (BQP) decision problems, our results imply that there exists a feature map and a quantum kernel that make variational quantum classifiers and quantum kernel support vector machines efficient solvers for any BQP problem. Hence, this work implies that their feature map and quantum kernel, respectively, can be designed to have a quantum advantage for any classification problem that cannot be classically solved in polynomial time but contrariwise by a quantum computer.
    BEBERT: Efficient and robust binary ensemble BERT. (arXiv:2210.15976v1 [cs.CL] CROSS LISTED)
    Pre-trained BERT models have achieved impressive accuracy on natural language processing (NLP) tasks. However, their excessive amount of parameters hinders them from efficient deployment on edge devices. Binarization of the BERT models can significantly alleviate this issue but comes with a severe accuracy drop compared with their full-precision counterparts. In this paper, we propose an efficient and robust binary ensemble BERT (BEBERT) to bridge the accuracy gap. To the best of our knowledge, this is the first work employing ensemble techniques on binary BERTs, yielding BEBERT, which achieves superior accuracy while retaining computational efficiency. Furthermore, we remove the knowledge distillation procedures during ensemble to speed up the training process without compromising accuracy. Experimental results on the GLUE benchmark show that the proposed BEBERT significantly outperforms the existing binary BERT models in accuracy and robustness with a 2x speedup on training time. Moreover, our BEBERT has only a negligible accuracy loss of 0.3% compared to the full-precision baseline while saving 15x and 13x in FLOPs and model size, respectively. In addition, BEBERT also outperforms other compressed BERTs in accuracy by up to 6.7%.
    On Sampling with Approximate Transport Maps. (arXiv:2302.04763v1 [stat.ML])
    Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can reliability be handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.
    Local Lipschitz Bounds of Deep Neural Networks. (arXiv:2004.13135v3 [stat.ML] UPDATED)
    The Lipschitz constant is an important quantity that arises in analysing the convergence of gradient-based optimization methods. It is generally unclear how to estimate the Lipschitz constant of a complex model. Thus, this paper studies an important problem that may be useful to the broader area of non-convex optimization. The main result provides a local upper bound on the Lipschitz constants of a multi-layer feed-forward neural network and its gradient. Moreover, lower bounds are established as well, which are used to show that it is impossible to derive global upper bounds for the Lipschitz constants. In contrast to previous works, we compute the Lipschitz constants with respect to the network parameters and not with respect to the inputs. These constants are needed for the theoretical description of many step size schedulers of gradient based optimization schemes and their convergence analysis. The idea is both simple and effective. The results are extended to a generalization of neural networks, continuously deep neural networks, which are described by controlled ODEs.
    Learning cosmology and clustering with cosmic graphs. (arXiv:2204.13713v2 [astro-ph.CO] UPDATED)
    We train deep learning models on thousands of galaxy catalogues from the state-of-the-art hydrodynamic simulations of the CAMELS project to perform regression and inference. We employ Graph Neural Networks (GNNs), architectures designed to work with irregular and sparse data, like the distribution of galaxies in the Universe. We first show that GNNs can learn to compute the power spectrum of galaxy catalogues with a few percent accuracy. We then train GNNs to perform likelihood-free inference at the galaxy-field level. Our models are able to infer the value of $\Omega_{\rm m}$ with a $\sim12\%-13\%$ accuracy just from the positions of $\sim1000$ galaxies in a volume of $(25~h^{-1}{\rm Mpc})^3$ at $z=0$ while accounting for astrophysical uncertainties as modelled in CAMELS. Incorporating information from galaxy properties, such as stellar mass, stellar metallicity, and stellar radius, increases the accuracy to $4\%-8\%$. Our models are built to be translational and rotational invariant, and they can extract information from any scale larger than the minimum distance between two galaxies. However, our models are not completely robust: testing on simulations run with a different subgrid physics than the ones used for training does not yield as accurate results.
    Decomposing a Recurrent Neural Network into Modules for Enabling Reusability and Replacement. (arXiv:2212.05970v3 [cs.SE] UPDATED)
    Can we take a recurrent neural network (RNN) trained to translate between languages and augment it to support a new natural language without retraining the model from scratch? Can we fix the faulty behavior of the RNN by replacing portions associated with the faulty behavior? Recent works on decomposing a fully connected neural network (FCNN) and convolutional neural network (CNN) into modules have shown the value of engineering deep models in this manner, which is standard in traditional SE but foreign for deep learning models. However, prior works focus on the image-based multiclass classification problems and cannot be applied to RNN due to (a) different layer structures, (b) loop structures, (c) different types of input-output architectures, and (d) usage of both nonlinear and logistic activation functions. In this work, we propose the first approach to decompose an RNN into modules. We study different types of RNNs, i.e., Vanilla, LSTM, and GRU. Further, we show how such RNN modules can be reused and replaced in various scenarios. We evaluate our approach against 5 canonical datasets (i.e., Math QA, Brown Corpus, Wiki-toxicity, Clinc OOS, and Tatoeba) and 4 model variants for each dataset. We found that decomposing a trained model has a small cost (Accuracy: -0.6%, BLEU score: +0.10%). Also, the decomposed modules can be reused and replaced without needing to retrain.
    Continual Causal Effect Estimation: Challenges and Opportunities. (arXiv:2301.01026v3 [cs.LG] UPDATED)
    A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.
    Efficient Attention via Control Variates. (arXiv:2302.04542v1 [cs.LG])
    Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon previous progress of RFA, we characterize this gap through the lens of control variates and show that RFA can be decomposed into a sum of multiple control variate estimators for each element in the sequence. This new framework reveals that exact softmax attention can be recovered from RFA by manipulating each control variate. Besides, it allows us to develop a more flexible form of control variates, resulting in a novel attention mechanism that significantly reduces the approximation gap while maintaining linear complexity. Extensive experiments demonstrate that our model outperforms state-of-the-art efficient attention mechanisms on both vision and language tasks.
    An Optimal Algorithm for Strongly Convex Min-min Optimization. (arXiv:2212.14439v2 [math.OC] UPDATED)
    In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{\kappa_x,\kappa_y\}} \log 1/\epsilon)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $\kappa_x$ and $\kappa_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires $\mathcal{O}(\sqrt{\kappa_x} \log 1/\epsilon)$ of computations of $\nabla_x f(x,y)$ and $\mathcal{O}(\sqrt{\kappa_y} \log 1/\epsilon)$ computations of $\nabla_y f(x,y)$. In some applications $\kappa_x \gg \kappa_y$, and computation of $\nabla_y f(x,y)$ is significantly cheaper than computation of $\nabla_x f(x,y)$. In this case, our algorithm substantially outperforms the existing state-of-the-art methods.
    Inferring halo masses with Graph Neural Networks. (arXiv:2111.08683v2 [astro-ph.CO] UPDATED)
    Understanding the halo-galaxy connection is fundamental in order to improve our knowledge on the nature and properties of dark matter. In this work we build a model that infers the mass of a halo given the positions, velocities, stellar masses, and radii of the galaxies it hosts. In order to capture information from correlations among galaxy properties and their phase-space, we use Graph Neural Networks (GNNs), that are designed to work with irregular and sparse data. We train our models on galaxies from more than 2,000 state-of-the-art simulations from the Cosmology and Astrophysics with MachinE Learning Simulations (CAMELS) project. Our model, that accounts for cosmological and astrophysical uncertainties, is able to constrain the masses of the halos with a $\sim$0.2 dex accuracy. Furthermore, a GNN trained on a suite of simulations is able to preserve part of its accuracy when tested on simulations run with a different code that utilizes a distinct subgrid physics model, showing the robustness of our method. The PyTorch Geometric implementation of the GNN is publicly available on Github at https://github.com/PabloVD/HaloGraphNet
    All the Feels: A dexterous hand with large area sensing. (arXiv:2210.15658v2 [cs.RO] UPDATED)
    High cost and lack of reliability has precluded the widespread adoption of dexterous hands in robotics. Furthermore, the lack of a viable tactile sensor capable of sensing over the entire area of the hand impedes the rich, low-level feedback that would improve learning of dexterous manipulation skills. This paper introduces an inexpensive, modular, robust, and scalable platform -- the DManus -- aimed at resolving these challenges while satisfying the large-scale data collection capabilities demanded by deep robot learning paradigms. Studies on human manipulation point to the criticality of low-level tactile feedback in performing everyday dexterous tasks. The DManus comes with ReSkin sensing on the entire surface of the palm as well as the fingertips. We demonstrate effectiveness of the fully integrated system in a tactile aware task -- bin picking and sorting. Code, documentation, design files, detailed assembly instructions, trained models, task videos, and all supplementary materials required to recreate the setup can be found on this http URL
    Explainable AI for Bioinformatics: Methods, Tools, and Applications. (arXiv:2212.13261v2 [q-bio.QM] UPDATED)
    Artificial intelligence (AI) systems utilizing deep neural networks (DNNs) and machine learning (ML) algorithms are widely used for solving important problems in bioinformatics, biomedical informatics, and precision medicine. However, complex DNNs or ML models, which are often perceived as opaque and black-box, can make it difficult to understand the reasoning behind their decisions. This lack of transparency can be a challenge for both end-users and decision-makers, as well as AI developers. Additionally, in sensitive areas like healthcare, explainability and accountability are not only desirable but also legally required for AI systems that can have a significant impact on human lives. Fairness is another growing concern, as algorithmic decisions should not show bias or discrimination towards certain groups or individuals based on sensitive attributes. Explainable artificial intelligence (XAI) aims to overcome the opaqueness of black-box models and provide transparency in how AI systems make decisions. Interpretable ML models can explain how they make predictions and the factors that influence their outcomes. However, most state-of-the-art interpretable ML methods are domain-agnostic and evolved from fields like computer vision, automated reasoning, or statistics, making direct application to bioinformatics problems challenging without customization and domain-specific adaptation. In this paper, we discuss the importance of explainability in the context of bioinformatics, provide an overview of model-specific and model-agnostic interpretable ML methods and tools, and outline their potential caveats and drawbacks. Besides, we discuss how to customize existing interpretable ML methods for bioinformatics problems. Nevertheless, we demonstrate how XAI methods can improve transparency through case studies in bioimaging, cancer genomics, and text mining.
    Hierarchical Generative Adversarial Imitation Learning with Mid-level Input Generation for Autonomous Driving on Urban Environments. (arXiv:2302.04823v1 [cs.LG])
    Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on rewards,Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive. In this work, the hGAIL architecture was proposed to solve the autonomous navigation of a vehicle in an end-to-end approach, mapping sensory perceptions directly to low-level actions, while simultaneously learning mid-level input representations of the agent's environment. The proposed hGAIL consists of an hierarchical Adversarial Imitation Learning architecture composed of two main modules: the GAN (Generative Adversarial Nets) which generates the Bird's-Eye View (BEV) representation mainly from the images of three frontal cameras of the vehicle, and the GAIL which learns to control the vehicle based mainly on the BEV predictions from the GAN as input.Our experiments have shown that GAIL exclusively from cameras (without BEV) fails to even learn the task, while hGAIL, after training, was able to autonomously navigate successfully in all intersections of the city.
    Reinforcement Learning Based Approaches to Adaptive Context Caching in Distributed Context Management Systems. (arXiv:2212.11709v2 [eess.SY] UPDATED)
    Performance metrics-driven context caching has a profound impact on throughput and response time in distributed context management systems for real-time context queries. This paper proposes a reinforcement learning based approach to adaptively cache context with the objective of minimizing the cost incurred by context management systems in responding to context queries. Our novel algorithms enable context queries and sub-queries to reuse and repurpose cached context in an efficient manner. This approach is distinctive to traditional data caching approaches by three main features. First, we make selective context cache admissions using no prior knowledge of the context, or the context query load. Secondly, we develop and incorporate innovative heuristic models to calculate expected performance of caching an item when making the decisions. Thirdly, our strategy defines a time-aware continuous cache action space. We present two reinforcement learning agents, a value function estimating actor-critic agent and a policy search agent using deep deterministic policy gradient method. The paper also proposes adaptive policies such as eviction and cache memory scaling to complement our objective. Our method is evaluated using a synthetically generated load of context sub-queries and a synthetic data set inspired from real world data and query samples. We further investigate optimal adaptive caching configurations under different settings. This paper presents, compares, and discusses our findings that the proposed selective caching methods reach short- and long-term cost- and performance-efficiency. The paper demonstrates that the proposed methods outperform other modes of context management such as redirector mode, and database mode, and cache all policy by up to 60% in cost efficiency.
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v2 [cs.LG] UPDATED)
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs. The code is available at: https://github.com/dchiji-ntt/meta-ticket
    Testing robustness of predictions of trained classifiers against naturally occurring perturbations. (arXiv:2204.10046v2 [cs.LG] UPDATED)
    Correctly quantifying the robustness of machine learning models is a central aspect in judging their suitability for specific tasks, and ultimately, for generating trust in them. We address the problem of finding the robustness of individual predictions. We show both theoretically and with empirical examples that a method based on counterfactuals that was previously proposed for this is insufficient, as it is not a valid metric for determining the robustness against perturbations that occur ``naturally'', outside specific adversarial attack scenarios. We propose a flexible approach that models possible perturbations in input data individually for each application. This is then combined with a probabilistic approach that computes the likelihood that a ``real-world'' perturbation will change a prediction, thus giving quantitative information of the robustness of individual predictions of the trained machine learning model. The method does not require access to the internals of the classifier and thus in principle works for any black-box model. It is, however, based on Monte-Carlo sampling and thus only suited for input spaces with small dimensions. We illustrate our approach on the Iris and the Ionosphere datasets, on an application predicting fog at an airport, and on analytically solvable cases.
    Modeling and Forecasting COVID-19 Cases using Latent Subpopulations. (arXiv:2302.04829v1 [cs.LG])
    Classical epidemiological models assume homogeneous populations. There have been important extensions to model heterogeneous populations, when the identity of the sub-populations is known, such as age group or geographical location. Here, we propose two new methods to model the number of people infected with COVID-19 over time, each as a linear combination of latent sub-populations -- i.e., when we do not know which person is in which sub-population, and the only available observations are the aggregates across all sub-populations. Method #1 is a dictionary-based approach, which begins with a large number of pre-defined sub-population models (each with its own starting time, shape, etc), then determines the (positive) weight of small (learned) number of sub-populations. Method #2 is a mixture-of-$M$ fittable curves, where $M$, the number of sub-populations to use, is given by the user. Both methods are compatible with any parametric model; here we demonstrate their use with first (a)~Gaussian curves and then (b)~SIR trajectories. We empirically show the performance of the proposed methods, first in (i) modeling the observed data and then in (ii) forecasting the number of infected people 1 to 4 weeks in advance. Across 187 countries, we show that the dictionary approach had the lowest mean absolute percentage error and also the lowest variance when compared with classical SIR models and moreover, it was a strong baseline that outperforms many of the models developed for COVID-19 forecasting.
    Compressing multidimensional weather and climate data into neural networks. (arXiv:2210.12538v2 [cs.LG] UPDATED)
    Weather and climate simulations produce petabytes of high-resolution data that are later analyzed by researchers in order to understand climate change or severe weather. We propose a new method of compressing this multidimensional weather and climate data: a coordinate-based neural network is trained to overfit the data, and the resulting parameters are taken as a compact representation of the original grid-based data. While compression ratios range from 300x to more than 3,000x, our method outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE, MAE. It can faithfully preserve important large scale atmosphere structures and does not introduce artifacts. When using the resulting neural network as a 790x compressed dataloader to train the WeatherBench forecasting model, its RMSE increases by less than 2%. The three orders of magnitude compression democratizes access to high-resolution climate data and enables numerous new research directions.
    Reconstruction of univariate functions from directional persistence diagrams. (arXiv:2203.01894v2 [math.AT] UPDATED)
    We describe a method for approximating a single-variable function $f$ using persistence diagrams of sublevel sets of $f$ from height functions in different directions. We provide algorithms for the piecewise linear case and for the smooth case. Three directions suffice to locate all local maxima and minima of a piecewise linear continuous function from its collection of directional persistence diagrams, while five directions are needed in the case of smooth functions with non-degenerate critical points. Our approximation of functions by means of persistence diagrams is motivated by a study of importance attribution in machine learning, where one seeks to reduce the number of critical points of signal functions without a significant loss of information for a neural network classifier.
    Offline Learning of Closed-Loop Deep Brain Stimulation Controllers for Parkinson Disease Treatment. (arXiv:2302.02477v2 [cs.LG] UPDATED)
    Deep brain stimulation (DBS) has shown great promise toward treating motor symptoms caused by Parkinson's disease (PD), by delivering electrical pulses to the Basal Ganglia (BG) region of the brain. However, DBS devices approved by the U.S. Food and Drug Administration (FDA) can only deliver continuous DBS (cDBS) stimuli at a fixed amplitude; this energy inefficient operation reduces battery lifetime of the device, cannot adapt treatment dynamically for activity, and may cause significant side-effects (e.g., gait impairment). In this work, we introduce an offline reinforcement learning (RL) framework, allowing the use of past clinical data to train an RL policy to adjust the stimulation amplitude in real time, with the goal of reducing energy use while maintaining the same level of treatment (i.e., control) efficacy as cDBS. Moreover, clinical protocols require the safety and performance of such RL controllers to be demonstrated ahead of deployments in patients. Thus, we also introduce an offline policy evaluation (OPE) method to estimate the performance of RL policies using historical data, before deploying them on patients. We evaluated our framework on four PD patients equipped with the RC+S DBS system, employing the RL controllers during monthly clinical visits, with the overall control efficacy evaluated by severity of symptoms (i.e., bradykinesia and tremor), changes in PD biomakers (i.e., local field potentials), and patient ratings. The results from clinical experiments show that our RL-based controller maintains the same level of control efficacy as cDBS, but with significantly reduced stimulation energy. Further, the OPE method is shown effective in accurately estimating and ranking the expected returns of RL controllers.
    MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing. (arXiv:2302.04501v1 [cs.LG])
    Multivariate time series forecasting has been widely used in various practical scenarios. Recently, Transformer-based models have shown significant potential in forecasting tasks due to the capture of long-range dependencies. However, recent studies in the vision and NLP fields show that the role of attention modules is not clear, which can be replaced by other token aggregation operations. This paper investigates the contributions and deficiencies of attention mechanisms on the performance of time series forecasting. Specifically, we find that (1) attention is not necessary for capturing temporal dependencies, (2) the entanglement and redundancy in the capture of temporal and channel interaction affect the forecasting performance, and (3) it is important to model the mapping between the input and the prediction sequence. To this end, we propose MTS-Mixers, which use two factorized modules to capture temporal and channel dependencies. Experimental results on several real-world datasets show that MTS-Mixers outperform existing Transformer-based models with higher efficiency.
    On the Permanence of Backdoors in Evolving Models. (arXiv:2206.04677v2 [cs.CR] UPDATED)
    Existing research on training-time attacks for deep neural networks (DNNs), such as backdoors, largely assume that models are static once trained, and hidden backdoors trained into models remain active indefinitely. In practice, models are rarely static but evolve continuously to address distribution drifts in the underlying data. This paper explores the behavior of backdoor attacks in time-varying models, whose model weights are continually updated via fine-tuning to adapt to data drifts. Our theoretical analysis shows how fine-tuning with fresh data progressively "erases" the injected backdoors, and our empirical study illustrates how quickly a time-varying model "forgets" backdoors under a variety of training and attack settings. We also show that novel fine-tuning strategies using smart learning rates can significantly accelerate backdoor forgetting. Finally, we discuss the need for new backdoor defenses that target time-varying models specifically.
    Feature Likelihood Score: Evaluating Generalization of Generative Models Using Samples. (arXiv:2302.04440v1 [cs.LG])
    Deep generative models have demonstrated the ability to generate complex, high-dimensional, and photo-realistic data. However, a unified framework for evaluating different generative modeling families remains a challenge. Indeed, likelihood-based metrics do not apply in many cases while pure sample-based metrics such as FID fail to capture known failure modes such as overfitting on training data. In this work, we introduce the Feature Likelihood Score (FLS), a parametric sample-based score that uses density estimation to quantitatively measure the quality/diversity of generated samples while taking into account overfitting. We empirically demonstrate the ability of FLS to identify specific overfitting problem cases, even when previously proposed metrics fail. We further perform an extensive experimental evaluation on various image datasets and model classes. Our results indicate that FLS matches intuitions of previous metrics, such as FID, while providing a more holistic evaluation of generative models that highlights models whose generalization abilities are under or overappreciated. Code for computing FLS is provided at https://github.com/marcojira/fls
    Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability. (arXiv:2204.06425v2 [cs.SE] UPDATED)
    The documentation practice for machine-learned (ML) models often falls short of established practices for traditional software, which impedes model accountability and inadvertently abets inappropriate or misuse of models. Recently, model cards, a proposal for model documentation, have attracted notable attention, but their impact on the actual practice is unclear. In this work, we systematically study the model documentation in the field and investigate how to encourage more responsible and accountable documentation practice. Our analysis of publicly available model cards reveals a substantial gap between the proposal and the practice. We then design a tool named DocML aiming to (1) nudge the data scientists to comply with the model cards proposal during the model development, especially the sections related to ethics, and (2) assess and manage the documentation quality. A lab study reveals the benefit of our tool towards long-term documentation quality and accountability.
    Conformal Off-policy Prediction. (arXiv:2206.06711v2 [stat.ML] UPDATED)
    Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.
    Decision Trees with Dynamic Graph Features. (arXiv:2207.02760v3 [cs.LG] UPDATED)
    When dealing with tabular data, models based on decision trees are a popular choice due to their high accuracy on these data types, their ease of application, and explainability properties. However, when it comes to graph-structured data, it is not clear how to apply them effectively, in a way that incorporates the topological information with the tabular data available on the vertices of the graph. To address this challenge, we introduce Decision Trees with Dynamic Graph Features (TREE-G). Rather than only using the pre-defined given features in the data, TREE-G acts on dynamic features, which are computed as the graph traverses the tree. These dynamic features combine the vertex features with the topological information, as well as the cumulative information learned by the tree. Therefore, the features adapt to the predictive task and the graph in hand. We analyze the theoretical properties of TREE-G and demonstrate its benefits empirically on multiple graph and node prediction benchmarks. In these experiments,TREE-G consistently outperformed other tree-based models and often outperformed other graph-learning algorithms such as Graph Neural Networks (GNNs) and Graph Kernels, sometimes by large margins. Finally, we also provide an explainability mechanism for TREE-G, and demonstrate that it can provide informative and intuitive explanations.
    The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning. (arXiv:2302.04658v1 [stat.ML])
    Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $\nu$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tilde\Theta(\frac{D}{f'(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling.
    Global and Preference-based Optimization with Mixed Variables using Piecewise Affine Surrogates. (arXiv:2302.04686v1 [math.OC])
    Optimization problems involving mixed variables, i.e., variables of numerical and categorical nature, can be challenging to solve, especially in the presence of complex constraints. Moreover, when the objective function is the result of a simulation or experiment, it may be expensive to evaluate. In this paper, we propose a novel surrogate-based global optimization algorithm, called PWAS, based on constructing a piecewise affine surrogate of the objective function over feasible samples. We introduce two types of exploration functions to efficiently search the feasible domain via mixed integer linear programming (MILP) solvers. We also provide a preference-based version of the algorithm, called PWASp, which can be used when only pairwise comparisons between samples can be acquired while the objective function remains unquantified. PWAS and PWASp are tested on mixed-variable benchmark problems with and without constraints. The results show that, within a small number of acquisitions, PWAS and PWASp can often achieve better or comparable results than other existing methods.
    Instrumental Variable Regression via Kernel Maximum Moment Loss. (arXiv:2010.07684v4 [cs.LG] UPDATED)
    We investigate a simple objective for nonlinear instrumental variable (IV) regression based on a kernelized conditional moment restriction (CMR) known as a maximum moment restriction (MMR). The MMR objective is formulated by maximizing the interaction between the residual and the instruments belonging to a unit ball in a reproducing kernel Hilbert space (RKHS). First, it allows us to simplify the IV regression as an empirical risk minimization problem, where the risk functional depends on the reproducing kernel on the instrument and can be estimated by a U-statistic or V-statistic. Second, based on this simplification, we are able to provide the consistency and asymptotic normality results in both parametric and nonparametric settings. Lastly, we provide easy-to-use IV regression algorithms with an efficient hyper-parameter selection procedure. We demonstrate the effectiveness of our algorithms using experiments on both synthetic and real-world data.
    Zero-Cost Operation Scoring in Differentiable Architecture Search. (arXiv:2106.06799v2 [cs.LG] UPDATED)
    We formalize and analyze a fundamental component of differentiable neural architecture search (NAS): local "operation scoring" at each operation choice. We view existing operation scoring functions as inexact proxies for accuracy, and we find that they perform poorly when analyzed empirically on NAS benchmarks. From this perspective, we introduce a novel \textit{perturbation-based zero-cost operation scoring} (Zero-Cost-PT) approach, which utilizes zero-cost proxies that were recently studied in multi-trial NAS but degrade significantly on larger search spaces, typical for differentiable NAS. We conduct a thorough empirical evaluation on a number of NAS benchmarks and large search spaces, from NAS-Bench-201, NAS-Bench-1Shot1, NAS-Bench-Macro, to DARTS-like and MobileNet-like spaces, showing significant improvements in both search time and accuracy. On the ImageNet classification task on the DARTS search space, our approach improved accuracy compared to the best current training-free methods (TE-NAS) while being over 10$\times$ faster (total searching time 25 minutes on a single GPU), and observed significantly better transferability on architectures searched on the CIFAR-10 dataset with an accuracy increase of 1.8 pp. Our code is available at: https://github.com/zerocostptnas/zerocost_operation_score.
    Scale-aware neural calibration for wide swath altimetry observations. (arXiv:2302.04497v1 [cs.LG])
    Sea surface height (SSH) is a key geophysical parameter for monitoring and studying meso-scale surface ocean dynamics. For several decades, the mapping of SSH products at regional and global scales has relied on nadir satellite altimeters, which provide one-dimensional-only along-track satellite observations of the SSH. The Surface Water and Ocean Topography (SWOT) mission deploys a new sensor that acquires for the first time wide-swath two-dimensional observations of the SSH. This provides new means to observe the ocean at previously unresolved spatial scales. A critical challenge for the exploiting of SWOT data is the separation of the SSH from other signals present in the observations. In this paper, we propose a novel learning-based approach for this SWOT calibration problem. It benefits from calibrated nadir altimetry products and a scale-space decomposition adapted to SWOT swath geometry and the structure of the different processes in play. In a supervised setting, our method reaches the state-of-the-art residual error of ~1.4cm while proposing a correction on the entire spectral from 10km to 1000k
    Classification of BCI-EEG based on augmented covariance matrix. (arXiv:2302.04508v1 [cs.HC])
    Objective: Electroencephalography signals are recorded as a multidimensional dataset. We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification. Methods: From the autoregressive model can be derived the Yule-Walker equations, which show the emergence of a symmetric positive definite matrix: the augmented covariance matrix. The state-of the art for classifying covariance matrices is based on Riemannian Geometry. A fairly natural idea is therefore to extend the standard approach using these augmented covariance matrices. The methodology for creating the augmented covariance matrix shows a natural connection with the delay embedding theorem proposed by Takens for dynamical systems. Such an embedding method is based on the knowledge of two parameters: the delay and the embedding dimension, respectively related to the lag and the order of the autoregressive model. This approach provides new methods to compute the hyper-parameters in addition to standard grid search. Results: The augmented covariance matrix performed noticeably better than any state-of-the-art methods. We will test our approach on several datasets and several subjects using the MOABB framework, using both within-session and cross-session evaluation. Conclusion: The improvement in results is due to the fact that the augmented covariance matrix incorporates not only spatial but also temporal information, incorporating nonlinear components of the signal through an embedding procedure, which allows the leveraging of dynamical systems algorithms. Significance: These results extend the concepts and the results of the Riemannian distance based classification algorithm.
    A Benchmark on Uncertainty Quantification for Deep Learning Prognostics. (arXiv:2302.04730v1 [cs.LG])
    Reliable uncertainty quantification on RUL prediction is crucial for informative decision-making in predictive maintenance. In this context, we assess some of the latest developments in the field of uncertainty quantification for prognostics deep learning. This includes the state-of-the-art variational inference algorithms for Bayesian neural networks (BNN) as well as popular alternatives such as Monte Carlo Dropout (MCD), deep ensembles (DE) and heteroscedastic neural networks (HNN). All the inference techniques share the same inception deep learning architecture as a functional model. We performed hyperparameter search to optimize the main variational and learning parameters of the algorithms. The performance of the methods is evaluated on a subset of the large NASA NCMAPSS dataset for aircraft engines. The assessment includes RUL prediction accuracy, the quality of predictive uncertainty, and the possibility to break down the total predictive uncertainty into its aleatoric and epistemic parts. The results show no method clearly outperforms the others in all the situations. Although all methods are close in terms of accuracy, we find differences in the way they estimate uncertainty. Thus, DE and MCD generally provide more conservative predictive uncertainty than BNN. Surprisingly, HNN can achieve strong results without the added training complexity and extra parameters of the BNN. For tasks like active learning where a separation of epistemic and aleatoric uncertainty is required, radial BNN and MCD seem the best options.
    Distributed Learning with Curious and Adversarial Machines. (arXiv:2302.04787v1 [cs.LG])
    The ubiquity of distributed machine learning (ML) in sensitive public domain applications calls for algorithms that protect data privacy, while being robust to faults and adversarial behaviors. Although privacy and robustness have been extensively studied independently in distributed ML, their synthesis remains poorly understood. We present the first tight analysis of the error incurred by any algorithm ensuring robustness against a fraction of adversarial machines, as well as differential privacy (DP) for honest machines' data against any other curious entity. Our analysis exhibits a fundamental trade-off between privacy, robustness, and utility. Surprisingly, we show that the cost of this trade-off is marginal compared to that of the classical privacy-utility trade-off. To prove our lower bound, we consider the case of mean estimation, subject to distributed DP and robustness constraints, and devise reductions to centralized estimation of one-way marginals. We prove our matching upper bound by presenting a new distributed ML algorithm using a high-dimensional robust aggregation rule. The latter amortizes the dependence on the dimension in the error (caused by adversarial workers and DP), while being agnostic to the statistical properties of the data.
    Lorentz Equivariant Model for Knowledge-Enhanced Collaborative Filtering. (arXiv:2302.04545v1 [cs.IR])
    Introducing prior auxiliary information from the knowledge graph (KG) to assist the user-item graph can improve the comprehensive performance of the recommender system. Many recent studies show that the ensemble properties of hyperbolic spaces fit the scale-free and hierarchical characteristics exhibited in the above two types of graphs well. However, existing hyperbolic methods ignore the consideration of equivariance, thus they cannot generalize symmetric features under given transformations, which seriously limits the capability of the model. Moreover, they cannot balance preserving the heterogeneity and mining the high-order entity information to users across two graphs. To fill these gaps, we propose a rigorously Lorentz group equivariant knowledge-enhanced collaborative filtering model (LECF). Innovatively, we jointly update the attribute embeddings (containing the high-order entity signals from the KG) and hyperbolic embeddings (the distance between hyperbolic embeddings reveals the recommendation tendency) by the LECF layer with Lorentz Equivariant Transformation. Moreover, we propose Hyperbolic Sparse Attention Mechanism to sample the most informative neighbor nodes. Lorentz equivariance is strictly maintained throughout the entire model, and enforcing equivariance is proven necessary experimentally. Extensive experiments on three real-world benchmarks demonstrate that LECF remarkably outperforms state-of-the-art methods.
    PulseDL-II: A System-on-Chip Neural Network Accelerator for Timing and Energy Extraction of Nuclear Detector Signals. (arXiv:2209.00884v2 [physics.ins-det] UPDATED)
    Front-end electronics equipped with high-speed digitizers are being used and proposed for future nuclear detectors. Recent literature reveals that deep learning models, especially one-dimensional convolutional neural networks, are promising when dealing with digital signals from nuclear detectors. Simulations and experiments demonstrate the satisfactory accuracy and additional benefits of neural networks in this area. However, specific hardware accelerating such models for online operations still needs to be studied. In this work, we introduce PulseDL-II, a system-on-chip (SoC) specially designed for applications of event feature (time, energy, etc.) extraction from pulses with deep learning. Based on the previous version, PulseDL-II incorporates a RISC CPU into the system structure for better functional flexibility and integrity. The neural network accelerator in the SoC adopts a three-level (arithmetic unit, processing element, neural network) hierarchical architecture and facilitates parameter optimization of the digital design. Furthermore, we devise a quantization scheme compatible with deep learning frameworks (e.g., TensorFlow) within a selected subset of layer types. We validate the correct operations of PulseDL-II on field programmable gate arrays (FPGA) alone and with an experimental setup comprising a direct digital synthesis (DDS) and analog-to-digital converters (ADC). The proposed system achieved 60 ps time resolution and 0.40% energy resolution at signal to noise ratio (SNR) of 47.4 dB.
    Jensen-Shannon Divergence Based Novel Loss Functions for Bayesian Neural Networks. (arXiv:2209.11366v3 [cs.LG] UPDATED)
    The Kullback-Leibler (KL) divergence is widely used in state-of-the-art Bayesian Neural Networks (BNNs) to approximate the posterior distribution of weights. However, the KL divergence is unbounded and asymmetric, which may lead to instabilities during optimization or may yield poor generalizations. To overcome these limitations, we examine the Jensen-Shannon (JS) divergence that is bounded, symmetric, and more general. Towards this, we propose two novel loss functions for BNNs. The first loss function uses the geometric JS divergence (JS-G) that is symmetric, unbounded, and offers an analytical expression for Gaussian priors. The second loss function uses the generalized JS divergence (JS-A) that is symmetric and bounded. We show that the conventional KL divergence-based loss function is a special case of the two loss functions presented in this work. To evaluate the divergence part of the loss we use analytical expressions for JS-G and use Monte Carlo methods for JS-A. We provide algorithms to optimize the loss function using both these methods. The proposed loss functions offer additional parameters that can be tuned to control the degree of regularisation. The regularization performance of the JS divergences is analyzed to demonstrate their superiority over the state-of-the-art. Further, we derive the conditions for better regularization by the proposed JS-G divergence-based loss function than the KL divergence-based loss function. Bayesian convolutional neural networks (BCNN) based on the proposed JS divergences perform better than the state-of-the-art BCNN, which is shown for the classification of the CIFAR data set having various degrees of noise and a histopathology data set having a high bias.
    Sparse Random Networks for Communication-Efficient Federated Learning. (arXiv:2209.15328v2 [cs.LG] UPDATED)
    One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \emph{random} values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a \emph{stochastic} binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -- or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than $1$ bit per parameter (bpp)), convergence speed, and final model size (less than $1$ bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime under various system configurations.
    Contestable Camera Cars: A Speculative Design Exploration of Public AI That Is Open and Responsive to Dispute. (arXiv:2302.04603v1 [cs.HC])
    Local governments increasingly use artificial intelligence (AI) for automated decision-making. Contestability, making systems responsive to dispute, is a way to ensure they respect human rights to autonomy and dignity. We investigate the design of public urban AI systems for contestability through the example of camera cars: human-driven vehicles equipped with image sensors. Applying a provisional framework for contestable AI, we use speculative design to create a concept video of a contestable camera car. Using this concept video, we then conduct semi-structured interviews with 17 civil servants who work with AI employed by a large northwestern European city. The resulting data is analyzed using reflexive thematic analysis to identify the main challenges facing the implementation of contestability in public AI. We describe how civic participation faces issues of representation, public AI systems should integrate with existing democratic practices, and cities must expand capacities for responsible AI development and operation.
    Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation. (arXiv:2302.04841v1 [cs.CV])
    Text-to-image generation models represent the next step of evolution in image synthesis, offering natural means of flexible yet fine-grained control over the result. One emerging area of research is the rapid adaptation of large text-to-image models to smaller datasets or new visual concepts. However, the most efficient method of adaptation, called textual inversion, has a known limitation of long training time, which both restricts practical applications and increases the experiment time for research. In this work, we study the training dynamics of textual inversion, aiming to speed it up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard model convergence metrics fail to indicate that. Instead, we propose a simple early stopping criterion that only requires computing the textual inversion loss on the same inputs for all training iterations. Our experiments on both Latent Diffusion and Stable Diffusion models for 93 concepts demonstrate the competitive performance of our method, speeding adaptation up to 15 times with no significant drops in quality.
    UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models. (arXiv:2302.04867v1 [cs.LG])
    Diffusion probabilistic models (DPMs) have demonstrated a very promising ability in high-resolution image synthesis. However, sampling from a pre-trained DPM usually requires hundreds of model evaluations, which is computationally expensive. Despite recent progress in designing high-order solvers for DPMs, there still exists room for further speedup, especially in extremely few steps (e.g., 5~10 steps). Inspired by the predictor-corrector for ODE solvers, we develop a unified corrector (UniC) that can be applied after any existing DPM sampler to increase the order of accuracy without extra model evaluations, and derive a unified predictor (UniP) that supports arbitrary order as a byproduct. Combining UniP and UniC, we propose a unified predictor-corrector framework called UniPC for the fast sampling of DPMs, which has a unified analytical form for any order and can significantly improve the sampling quality over previous methods. We evaluate our methods through extensive experiments including both unconditional and conditional sampling using pixel-space and latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional) and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function evaluations. Code is available at https://github.com/wl-zhao/UniPC
    Anomal-E: A Self-Supervised Network Intrusion Detection System based on Graph Neural Networks. (arXiv:2207.06819v5 [cs.LG] UPDATED)
    This paper investigates Graph Neural Networks (GNNs) application for self-supervised network intrusion and anomaly detection. GNNs are a deep learning approach for graph-based data that incorporate graph structures into learning to generalise graph representations and output embeddings. As network flows are naturally graph-based, GNNs are a suitable fit for analysing and learning network behaviour. The majority of current implementations of GNN-based Network Intrusion Detection Systems (NIDSs) rely heavily on labelled network traffic which can not only restrict the amount and structure of input traffic, but also the NIDSs potential to adapt to unseen attacks. To overcome these restrictions, we present Anomal-E, a GNN approach to intrusion and anomaly detection that leverages edge features and graph topological structure in a self-supervised process. This approach is, to the best our knowledge, the first successful and practical approach to network intrusion detection that utilises network flows in a self-supervised, edge leveraging GNN. Experimental results on two modern benchmark NIDS datasets not only clearly display the improvement of using Anomal-E embeddings rather than raw features, but also the potential Anomal-E has for detection on wild network traffic.
    Equivariant MuZero. (arXiv:2302.04798v1 [cs.LG])
    Deep reinforcement learning repeatedly succeeds in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms, such as the highly successful MuZero, aim to accomplish this by learning a world model. However, leveraging a world model has not consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.
    $O(T^{-1})$ Convergence of Optimistic-Follow-the-Regularized-Leader in Two-Player Zero-Sum Markov Games. (arXiv:2209.12430v2 [cs.LG] UPDATED)
    We prove that optimistic-follow-the-regularized-leader (OFTRL), together with smooth value updates, finds an $O(T^{-1})$-approximate Nash equilibrium in $T$ iterations for two-player zero-sum Markov games with full information. This improves the $\tilde{O}(T^{-5/6})$ convergence rate recently shown in the paper Zhang et al (2022). The refined analysis hinges on two essential ingredients. First, the sum of the regrets of the two players, though not necessarily non-negative as in normal-form games, is approximately non-negative in Markov games. This property allows us to bound the second-order path lengths of the learning dynamics. Second, we prove a tighter algebraic inequality regarding the weights deployed by OFTRL that shaves an extra $\log T$ factor. This crucial improvement enables the inductive analysis that leads to the final $O(T^{-1})$ rate.
    An Information-Theoretic Analysis of Nonstationary Bandit Learning. (arXiv:2302.04452v1 [cs.LG])
    In nonstationary bandit learning problems, the decision-maker must continually gather information and adapt their action selection as the latent state of the environment evolves. In each time period, some latent optimal action maximizes expected reward under the environment state. We view the optimal action sequence as a stochastic process, and take an information-theoretic approach to analyze attainable performance. We bound limiting per-period regret in terms of the entropy rate of the optimal action process. The bound applies to a wide array of problems studied in the literature and reflects the problem's information structure through its information-ratio.
    Confident Sinkhorn Allocation for Pseudo-Labeling. (arXiv:2206.05880v3 [cs.LG] UPDATED)
    Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data. It has been successfully applied to structured data, such as images and natural language, by exploiting the inherent spatial and semantic structure therein with retrained models or data augmentation. These methods are not applicable, however, when the data does not have the appropriate structure, or invariances. Due to their simplicity, pseudo-labeling (PL) methods can be widely used without any domain assumptions. These methods make greedy pseudo-label assignments, however, meaning that a single misclassification can have cascading consequences. This paper addresses this problem by proposing a Confident Sinkhorn Allocation (CSA), which identifies the best pseudo-label allocation via optimal transport. In doing so it considers the uncertainty in the predicted labels for the whole unlabelled set, in contrast to greedy allocation approaches. CSA outperforms the current state-of-the-art in this practically important area of semi-supervised learning. Additionally, we propose to use the Integral Probability Metrics to extend and improve the existing PAC-Bayes bound which relies on the Kullback-Leibler (KL) divergence, for ensemble models. Our code is publicly available at https://github.com/amzn/confident-sinkhorn-allocation.
    Fair-Net: A Network Architecture For Reducing Performance Disparity Between Identifiable Sub-Populations. (arXiv:2106.00720v3 [cs.LG] UPDATED)
    In real world datasets, particular groups are under-represented, much rarer than others, and machine learning classifiers will often preform worse on under-represented populations. This problem is aggravated across many domains where datasets are class imbalanced, with a minority class far rarer than the majority class. Naive approaches to handle under-representation and class imbalance include training sub-population specific classifiers that handle class imbalance or training a global classifier that overlooks sub-population disparities and aims to achieve high overall accuracy by handling class imbalance. In this study, we find that these approaches are vulnerable in class imbalanced datasets with minority sub-populations. We introduced Fair-Net, a branched multitask neural network architecture that improves both classification accuracy and probability calibration across identifiable sub-populations in class imbalanced datasets. Fair-Nets is a straightforward extension to the output layer and error function of a network, so can be incorporated in far more complex architectures. Empirical studies with three real world benchmark datasets demonstrate that Fair-Net improves classification and calibration performance, substantially reducing performance disparity between gender and racial sub-populations.
    How to avoid machine learning pitfalls: a guide for academic researchers. (arXiv:2108.02497v3 [cs.LG] UPDATED)
    This document is a concise outline of some of the common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it was originally written for research students, and focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.
    Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes. (arXiv:2302.04534v1 [cs.LG])
    Autoencoders and their variants are among the most widely used models in representation learning and generative modeling. However, autoencoder-based models usually assume that the learned representations are i.i.d. and fail to capture the correlations between the data samples. To address this issue, we propose a novel Sparse Gaussian Process Bayesian Autoencoder (SGPBAE) model in which we impose fully Bayesian sparse Gaussian Process priors on the latent space of a Bayesian Autoencoder. We perform posterior estimation for this model via stochastic gradient Hamiltonian Monte Carlo. We evaluate our approach qualitatively and quantitatively on a wide range of representation learning and generative modeling tasks and show that our approach consistently outperforms multiple alternatives relying on Variational Autoencoders.
    DPSNN: A Differentially Private Spiking Neural Network. (arXiv:2205.12718v2 [cs.NE] UPDATED)
    Privacy-preserving is a key problem for the machine learning algorithm. Spiking neural network (SNN) plays an important role in many domains, such as image classification, object detection, and speech recognition, but the study on the privacy protection of SNN is urgently needed. This study combines the differential privacy (DP) algorithm and SNN and proposes differentially private spiking neural network (DPSNN). DP injects noise into the gradient, and SNN transmits information in discrete spike trains so that our differentially private SNN can maintain strong privacy protection while still ensuring high accuracy. We conducted experiments on MNIST, Fashion-MNIST, and the face recognition dataset Extended YaleB. When the privacy protection is improved, the accuracy of the artificial neural network(ANN) drops significantly, but our algorithm shows little change in performance. Meanwhile, we analyzed different factors that affect the privacy protection of SNN. Firstly, the less precise the surrogate gradient is, the better the privacy protection of the SNN. Secondly, the Integrate-And-Fire (IF) neurons perform better than leaky Integrate-And-Fire (LIF) neurons. Thirdly, a large time window contributes more to privacy protection and performance.
    Zero-Knowledge Zero-Shot Learning for Novel Visual Category Discovery. (arXiv:2302.04427v1 [cs.CV])
    Generalized Zero-Shot Learning (GZSL) and Open-Set Recognition (OSR) are two mainstream settings that greatly extend conventional visual object recognition. However, the limitations of their problem settings are not negligible. The novel categories in GZSL require pre-defined semantic labels, making the problem setting less realistic; the oversimplified unknown class in OSR fails to explore the innate fine-grained and mixed structures of novel categories. In light of this, we are motivated to consider a new problem setting named Zero-Knowledge Zero-Shot Learning (ZK-ZSL) that assumes no prior knowledge of novel classes and aims to classify seen and unseen samples and recover semantic attributes of the fine-grained novel categories for further interpretation. To achieve this, we propose a novel framework that recovers the clustering structures of both seen and unseen categories where the seen class structures are guided by source labels. In addition, a structural alignment loss is designed to aid the semantic learning of unseen categories with their recovered structures. Experimental results demonstrate our method's superior performance in classification and semantic recovery on four benchmark datasets.
    Scaling Goal-based Exploration via Pruning Proto-goals. (arXiv:2302.04693v1 [cs.LG])
    One of the gnarliest challenges in reinforcement learning (RL) is exploration that scales to vast domains, where novelty-, or coverage-seeking behaviour falls short. Goal-directed, purposeful behaviours are able to overcome this, but rely on a good goal space. The core challenge in goal discovery is finding the right balance between generality (not hand-crafted) and tractability (useful, not too many). Our approach explicitly seeks the middle ground, enabling the human designer to specify a vast but meaningful proto-goal space, and an autonomous discovery process to refine this to a narrower space of controllable, reachable, novel, and relevant goals. The effectiveness of goal-conditioned exploration with the latter is then demonstrated in three challenging environments.
    One-shot Visual Imitation via Attributed Waypoints and Demonstration Augmentation. (arXiv:2302.04856v1 [cs.RO])
    In this paper, we analyze the behavior of existing techniques and design new solutions for the problem of one-shot visual imitation. In this setting, an agent must solve a novel instance of a novel task given just a single visual demonstration. Our analysis reveals that current methods fall short because of three errors: the DAgger problem arising from purely offline training, last centimeter errors in interacting with objects, and mis-fitting to the task context rather than to the actual task. This motivates the design of our modular approach where we a) separate out task inference (what to do) from task execution (how to do it), and b) develop data augmentation and generation techniques to mitigate mis-fitting. The former allows us to leverage hand-crafted motor primitives for task execution which side-steps the DAgger problem and last centimeter errors, while the latter gets the model to focus on the task rather than the task context. Our model gets 100% and 48% success rates on two recent benchmarks, improving upon the current state-of-the-art by absolute 90% and 20% respectively.
    Cooperative Open-ended Learning Framework for Zero-shot Coordination. (arXiv:2302.04831v1 [cs.AI])
    Zero-shot coordination in cooperative artificial intelligence (AI) remains a significant challenge, which means effectively coordinating with a wide range of unseen partners. Previous algorithms have attempted to address this challenge by optimizing fixed objectives within a population to improve strategy or behavior diversity. However, these approaches can result in a loss of learning and an inability to cooperate with certain strategies within the population, known as cooperative incompatibility. To address this issue, we propose the Cooperative Open-ended LEarning (COLE) framework, which constructs open-ended objectives in cooperative games with two players from the perspective of graph theory to assess and identify the cooperative ability of each strategy. We further specify the framework and propose a practical algorithm that leverages knowledge from game theory and graph theory. Furthermore, an analysis of the learning process of the algorithm shows that it can efficiently overcome cooperative incompatibility. The experimental results in the Overcooked game environment demonstrate that our method outperforms current state-of-the-art methods when coordinating with different-level partners. Our code and demo are available at https://sites.google.com/view/cole-2023.
    Towards Ignoring Backgrounds and Improving Generalization: a Costless DNN Visual Attention Mechanism. (arXiv:2202.00232v5 [eess.IV] UPDATED)
    This work introduces an attention mechanism for image classifiers and the corresponding deep neural network (DNN) architecture, dubbed ISNet. During training, the ISNet uses segmentation targets to learn how to find the image's region of interest and concentrate its attention on it. The proposal is based on a novel concept, background relevance minimization in LRP explanation heatmaps. It can be applied to virtually any classification neural network architecture, without any extra computational cost at run-time. Capable of ignoring the background, the resulting single DNN can substitute the common pipeline of a segmenter followed by a classifier, being faster and lighter. We tested the ISNet with three applications: COVID-19 and tuberculosis detection in chest X-rays, and facial attribute estimation. The first two tasks employed mixed training databases, which fostered background bias and shortcut learning. By focusing on lungs, the ISNet reduced shortcut learning, improving generalization to external (out-of-distribution) test datasets. When training data presented background bias, the ISNet's test performance significantly surpassed standard classifiers, multi-task DNNs (performing classification and segmentation), attention-gated neural networks, Guided Attention Inference Networks, and the standard segmentation-classification pipeline. Facial attribute estimation demonstrated that ISNet could precisely focus on faces, being also applicable to natural images. ISNet presents an accurate, fast, and light methodology to ignore backgrounds and improve generalization, especially when background bias is a concern.
    Compositional Scene Representation Learning via Reconstruction: A Survey. (arXiv:2202.07135v3 [cs.LG] UPDATED)
    Visual scenes are composed of visual concepts and have the property of combinatorial explosion. An important reason for humans to efficiently learn from diverse visual scenes is the ability of compositional perception, and it is desirable for artificial intelligence to have similar abilities. Compositional scene representation learning is a task that enables such abilities. In recent years, various methods have been proposed to apply deep neural networks, which have been proven to be advantageous in representation learning, to learn compositional scene representations via reconstruction, advancing this research direction into the deep learning era. Learning via reconstruction is advantageous because it may utilize massive unlabeled data and avoid costly and laborious data annotation. In this survey, we first outline the current progress on reconstruction-based compositional scene representation learning with deep neural networks, including development history and categorizations of existing methods from the perspectives of the modeling of visual scenes and the inference of scene representations; then provide benchmarks, including an open source toolbox to reproduce the benchmark experiments, of representative methods that consider the most extensively studied problem setting and form the foundation for other methods; and finally discuss the limitations of existing methods and future directions of this research topic.
    Fairness Amidst Non-IID Graph Data: Current Achievements and Future Directions. (arXiv:2202.07170v3 [cs.LG] UPDATED)
    The importance of understanding and correcting algorithmic bias in machine learning (ML) has led to an increase in research on fairness in ML, which typically assumes that the underlying data is independent and identically distributed (IID). However, in reality, data is often represented using non-IID graph structures that capture connections among individual units. To address bias in ML systems, it is crucial to bridge the gap between the traditional fairness literature designed for IID data and the ubiquity of non-IID graph data. In this survey, we review such recent advance in fairness amidst non-IID graph data and identify datasets and evaluation metrics available for future research. We also point out the limitations of existing work as well as promising future directions.
    On Non-Linear operators for Geometric Deep Learning. (arXiv:2207.03485v2 [cs.LG] UPDATED)
    This work studies operators mapping vector and scalar fields defined over a manifold $\mathcal{M}$, and which commute with its group of diffeomorphisms $\text{Diff}(\mathcal{M})$. We prove that in the case of scalar fields $L^p_\omega(\mathcal{M,\mathbb{R}})$, those operators correspond to point-wise non-linearities, recovering and extending known results on $\mathbb{R}^d$. In the context of Neural Networks defined over $\mathcal{M}$, it indicates that point-wise non-linear operators are the only universal family that commutes with any group of symmetries, and justifies their systematic use in combination with dedicated linear operators commuting with specific symmetries. In the case of vector fields $L^p_\omega(\mathcal{M},T\mathcal{M})$, we show that those operators are solely the scalar multiplication. It indicates that $\text{Diff}(\mathcal{M})$ is too rich and that there is no universal class of non-linear operators to motivate the design of Neural Networks over the symmetries of $\mathcal{M}$.
    SF-SGL: Solver-Free Spectral Graph Learning from Linear Measurements. (arXiv:2302.04384v1 [cs.LG])
    This work introduces a highly-scalable spectral graph densification framework (SGL) for learning resistor networks with linear measurements, such as node voltages and currents. We show that the proposed graph learning approach is equivalent to solving the classical graphical Lasso problems with Laplacian-like precision matrices. We prove that given $O(\log N)$ pairs of voltage and current measurements, it is possible to recover sparse $N$-node resistor networks that can well preserve the effective resistance distances on the original graph. In addition, the learned graphs also preserve the structural (spectral) properties of the original graph, which can potentially be leveraged in many circuit design and optimization tasks. To achieve more scalable performance, we also introduce a solver-free method (SF-SGL) that exploits multilevel spectral approximation of the graphs and allows for a scalable and flexible decomposition of the entire graph spectrum (to be learned) into multiple different eigenvalue clusters (frequency bands). Such a solver-free approach allows us to more efficiently identify the most spectrally-critical edges for reducing various ranges of spectral embedding distortions. Through extensive experiments for a variety of real-world test cases, we show that the proposed approach is highly scalable for learning sparse resistor networks without sacrificing solution quality. We also introduce a data-driven EDA algorithm for vectorless power/thermal integrity verifications to allow estimating worst-case voltage/temperature (gradient) distributions across the entire chip by leveraging a few voltage/temperature measurements.
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v8 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.
    Bayesian MRI Reconstruction with Joint Uncertainty Estimation using Diffusion Models. (arXiv:2202.01479v3 [cs.LG] UPDATED)
    We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction. Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method. In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed. The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement. This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models. Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models. The performance of the method is evaluated on an open dataset using 10-fold undersampling in k-space.
    Differentially Private Deep Q-Learning for Pattern Privacy Preservation in MEC Offloading. (arXiv:2302.04608v1 [cs.NI])
    Mobile edge computing (MEC) is a promising paradigm to meet the quality of service (QoS) requirements of latency-sensitive IoT applications. However, attackers may eavesdrop on the offloading decisions to infer the edge server's (ES's) queue information and users' usage patterns, thereby incurring the pattern privacy (PP) issue. Therefore, we propose an offloading strategy which jointly minimizes the latency, ES's energy consumption, and task dropping rate, while preserving PP. Firstly, we formulate the dynamic computation offloading procedure as a Markov decision process (MDP). Next, we develop a Differential Privacy Deep Q-learning based Offloading (DP-DQO) algorithm to solve this problem while addressing the PP issue by injecting noise into the generated offloading decisions. This is achieved by modifying the deep Q-network (DQN) with a Function-output Gaussian process mechanism. We provide a theoretical privacy guarantee and a utility guarantee (learning error bound) for the DP-DQO algorithm and finally, conduct simulations to evaluate the performance of our proposed algorithm by comparing it with greedy and DQN-based algorithms.
    Lazy OCO: Online Convex Optimization on a Switching Budget. (arXiv:2102.03803v5 [cs.LG] UPDATED)
    We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary. In this work, we aim to fill the gap and present computationally efficient algorithms in the more prevalent oblivious setting, establishing a regret bound of $O(T/S)$ for general convex losses and $\widetilde O(T/S^2)$ for strongly convex losses. In addition, for stochastic i.i.d.~losses, we present a simple algorithm that performs $\log T$ switches with only a multiplicative $\log T$ factor overhead in its regret in both the general and strongly convex settings. Finally, we complement our algorithms with lower bounds that match our upper bounds in some of the cases we consider.
    Privacy-Preserving Representation Learning for Text-Attributed Networks with Simplicial Complexes. (arXiv:2302.04383v1 [cs.LG])
    Although recent network representation learning (NRL) works in text-attributed networks demonstrated superior performance for various graph inference tasks, learning network representations could always raise privacy concerns when nodes represent people or human-related variables. Moreover, standard NRLs that leverage structural information from a graph proceed by first encoding pairwise relationships into learned representations and then analysing its properties. This approach is fundamentally misaligned with problems where the relationships involve multiple points, and topological structure must be encoded beyond pairwise interactions. Fortunately, the machinery of topological data analysis (TDA) and, in particular, simplicial neural networks (SNNs) offer a mathematically rigorous framework to learn higher-order interactions between nodes. It is critical to investigate if the representation outputs from SNNs are more vulnerable compared to regular representation outputs from graph neural networks (GNNs) via pairwise interactions. In my dissertation, I will first study learning the representations with text attributes for simplicial complexes (RT4SC) via SNNs. Then, I will conduct research on two potential attacks on the representation outputs from SNNs: (1) membership inference attack, which infers whether a certain node of a graph is inside the training data of the GNN model; and (2) graph reconstruction attacks, which infer the confidential edges of a text-attributed network. Finally, I will study a privacy-preserving deterministic differentially private alternating direction method of multiplier to learn secure representation outputs from SNNs that capture multi-scale relationships and facilitate the passage from local structure to global invariant features on text-attributed networks.
    Enhancing E-Commerce Recommendation using Pre-Trained Language Model and Fine-Tuning. (arXiv:2302.04443v1 [cs.LG])
    Pretrained Language Models (PLM) have been greatly successful on a board range of natural language processing (NLP) tasks. However, it has just started being applied to the domain of recommendation systems. Traditional recommendation algorithms failed to incorporate the rich textual information in e-commerce datasets, which hinderss the performance of those models. We present a thorough investigation on the effect of various strategy of incorporating PLMs into traditional recommender algorithms on one of the e-commerce datasets, and we compare the results with vanilla recommender baseline models. We show that the application of PLMs and domain specific fine-tuning lead to an increase on the predictive capability of combined models. These results accentuate the importance of utilizing textual information in the context of e-commerce, and provides insight on how to better apply PLMs alongside traditional recommender system algorithms. The code used in this paper is available on Github: https://github.com/NuofanXu/bert_retail_recommender.
    Analysing the SEDs of protoplanetary disks with machine learning. (arXiv:2302.04629v1 [astro-ph.EP])
    ABRIDGED. The analysis of spectral energy distributions (SEDs) of protoplanetary disks to determine their physical properties is known to be highly degenerate. Hence, a Bayesian analysis is required to obtain parameter uncertainties and degeneracies. The challenge here is computational speed, as one radiative transfer model requires a couple of minutes to compute. We performed a Bayesian analysis for 30 well-known protoplanetary disks to determine their physical disk properties, including uncertainties and degeneracies. To circumvent the computational cost problem, we created neural networks (NNs) to emulate the SED generation process. We created two sets of radiative transfer disk models to train and test two NNs that predict SEDs for continuous and discontinuous disks. A Bayesian analysis was then performed on 30 protoplanetary disks with SED data collected by the DIANA project to determine the posterior distributions of all parameters. We ran this analysis twice, (i) with old distances and additional parameter constraints as used in a previous study, to compare results, and (ii) with updated distances and free choice of parameters to obtain homogeneous and unbiased model parameters. We evaluated the uncertainties in the determination of physical disk parameters from SED analysis, and detected and quantified the strongest degeneracies. The NNs are able to predict SEDs within 1ms with uncertainties of about 5% compared to the true SEDs obtained by the radiative transfer code. We find parameter values and uncertainties that are significantly different from previous values obtained by $\chi^2$ fitting. Comparing the global evidence for continuous and discontinuous disks, we find that 26 out of 30 objects are better described by disks that have two distinct radial zones. Also, we created an interactive tool that instantly returns the SED predicted by our NNs for any parameter combination.
    Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning. (arXiv:2302.04858v1 [cs.CV])
    Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating new data, requiring a computational-expensive fine-tuning process. In this work, we introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo, that supports retrieving the relevant knowledge from the external database for zero and in-context few-shot image-to-text generations. By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters and can easily accommodate new data during evaluation by simply updating the database. We also construct an interleaved image and text data that facilitates in-context few-shot learning capabilities. We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks, especially for zero-shot and few-shot generation in out-of-domain settings with 4 times less parameters compared with baseline methods.
    Trading Information between Latents in Hierarchical Variational Autoencoders. (arXiv:2302.04855v1 [stat.ML])
    Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content ("bit rate") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.
    DeepCAM: A Fully CAM-based Inference Accelerator with Variable Hash Lengths for Energy-efficient Deep Neural Networks. (arXiv:2302.04712v1 [cs.LG])
    With ever increasing depth and width in deep neural networks to achieve state-of-the-art performance, deep learning computation has significantly grown, and dot-products remain dominant in overall computation time. Most prior works are built on conventional dot-product where weighted input summation is used to represent the neuron operation. However, another implementation of dot-product based on the notion of angles and magnitudes in the Euclidean space has attracted limited attention. This paper proposes DeepCAM, an inference accelerator built on two critical innovations to alleviate the computation time bottleneck of convolutional neural networks. The first innovation is an approximate dot-product built on computations in the Euclidean space that can replace addition and multiplication with simple bit-wise operations. The second innovation is a dynamic size content addressable memory-based (CAM-based) accelerator to perform bit-wise operations and accelerate the CNNs with a lower computation time. Our experiments on benchmark image recognition datasets demonstrate that DeepCAM is up to 523x and 3498x faster than Eyeriss and traditional CPUs like Intel Skylake, respectively. Furthermore, the energy consumed by our DeepCAM approach is 2.16x to 109x less compared to Eyeriss.
    How degenerate is the parametrization of neural networks with the ReLU activation function?. (arXiv:1905.09803v3 [cs.LG] UPDATED)
    Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the main loss function generally only depends on the realization of the neural network, i.e. the function it computes. Studying the optimization problem over the space of realizations opens up new ways to understand neural network training. In particular, usual loss functions like mean squared error and categorical cross entropy are convex on spaces of neural network realizations, which themselves are non-convex. Approximation capabilities of neural networks can be used to deal with the latter non-convexity, which allows us to establish that for sufficiently large networks local minima of a regularized optimization problem on the realization space are almost optimal. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.
    InfoNCE is a variational autoencoder. (arXiv:2107.02495v2 [stat.ML] UPDATED)
    There are two main approaches to self-supervised learning (SSL), generative SSL, which learns a probabilistic model of the inputs, or contrastive SSL where we design a supervised learning task to encourage good representations. We reconcile these approaches by showing that contrastive SSL methods (including InfoNCE) which maximize the mutual information (MI) implicitly learn a probabilistic model of the inputs (specifically, a variational autoencoder; VAE). In particular, when we learn the optimal prior, the VAE objective (the ELBO) becomes equal to the MI (up to a constant). In turn, for a deterministic encoder the ELBO is equal to the log Bayesian model evidence. This establishes a profound connection between Bayesian inference and information theory. However, practical InfoNCE methods do not use the MI as an objective: the MI is invariant to arbitrary invertible transformations, so using an MI objective can lead to highly entangled representations (Tschannen et al., 2019). Instead, the actual InfoNCE objective is a simplified lower bound on the MI which is loose even in the infinite sample limit. Thus, an objective that works (i.e. the actual InfoNCE objective) appears to be motivated as a loose bound on an objective that does not work (i.e. the true MI which gives arbitrarily entangled representations). We give an alternative motivation for the actual InfoNCE objective. In particular, we show that in the infinite sample limit, and for a particular choice of prior, the actual InfoNCE objective is equal to the log Bayesian model evidence (up to a constant). Thus, we argue that our VAE perspective gives a better motivation for InfoNCE than MI, as the actual InfoNCE objective is only loosely bounded by the MI, but is equal to the log Bayesian model evidence (up to a constant).
    Mask Conditional Synthetic Satellite Imagery. (arXiv:2302.04305v1 [cs.CV])
    In this paper we propose a mask-conditional synthetic image generation model for creating synthetic satellite imagery datasets. Given a dataset of real high-resolution images and accompanying land cover masks, we show that it is possible to train an upstream conditional synthetic imagery generator, use that generator to create synthetic imagery with the land cover masks, then train a downstream model on the synthetic imagery and land cover masks that achieves similar test performance to a model that was trained with the real imagery. Further, we find that incorporating a mixture of real and synthetic imagery acts as a data augmentation method, producing better models than using only real imagery (0.5834 vs. 0.5235 mIoU). Finally, we find that encouraging diversity of outputs in the upstream model is a necessary component for improved downstream task performance. We have released code for reproducing our work on GitHub, see https://github.com/ms-synthetic-satellite-image/synthetic-satellite-imagery .
    Polynomial Neural Fields for Subband Decomposition and Manipulation. (arXiv:2302.04862v1 [cs.CV])
    Neural fields have emerged as a new paradigm for representing signals, thanks to their ability to do it compactly while being easy to optimize. In most applications, however, neural fields are treated like black boxes, which precludes many signal manipulation tasks. In this paper, we propose a new class of neural fields called polynomial neural fields (PNFs). The key advantage of a PNF is that it can represent a signal as a composition of a number of manipulable and interpretable components without losing the merits of neural fields representation. We develop a general theoretical framework to analyze and design PNFs. We use this framework to design Fourier PNFs, which match state-of-the-art performance in signal representation tasks that use neural fields. In addition, we empirically demonstrate that Fourier PNFs enable signal manipulation applications such as texture transfer and scale-space interpolation. Code is available at https://github.com/stevenygd/PNF.
    Lightweight Transformers for Clinical Natural Language Processing. (arXiv:2302.04725v1 [cs.CL])
    Specialised pre-trained language models are becoming more frequent in NLP since they can potentially outperform models trained on generic texts. BioBERT and BioClinicalBERT are two examples of such models that have shown promise in medical NLP tasks. Many of these models are overparametrised and resource-intensive, but thanks to techniques like Knowledge Distillation (KD), it is possible to create smaller versions that perform almost as well as their larger counterparts. In this work, we specifically focus on development of compact language models for processing clinical texts (i.e. progress notes, discharge summaries etc). We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning, with the number of parameters ranging from 15 million to 65 million. These models performed comparably to larger models such as BioBERT and ClinicalBioBERT and significantly outperformed other compact models trained on general or biomedical data. Our extensive evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks, including Natural Language Inference, Relation Extraction, Named Entity Recognition, and Sequence Classification. To our knowledge, this is the first comprehensive study specifically focused on creating efficient and compact transformers for clinical NLP tasks. The models and code used in this study can be found on our Huggingface profile at https://huggingface.co/nlpie and Github page at https://github.com/nlpie-research/Lightweight-Clinical-Transformers, respectively, promoting reproducibility of our results.
    Dimension reduction and redundancy removal through successive Schmidt decompositions. (arXiv:2302.04801v1 [quant-ph])
    Quantum computers are believed to have the ability to process huge data sizes which can be seen in machine learning applications. In these applications, the data in general is classical. Therefore, to process them on a quantum computer, there is a need for efficient methods which can be used to map classical data on quantum states in a concise manner. On the other hand, to verify the results of quantum computers and study quantum algorithms, we need to be able to approximate quantum operations into forms that are easier to simulate on classical computers with some errors. Motivated by these needs, in this paper we study the approximation of matrices and vectors by using their tensor products obtained through successive Schmidt decompositions. We show that data with distributions such as uniform, Poisson, exponential, or similar to these distributions can be approximated by using only a few terms which can be easily mapped onto quantum circuits. The examples include random data with different distributions, the Gram matrices of iris flower, handwritten digits, 20newsgroup, and labeled faces in the wild. And similarly, some quantum operations such as quantum Fourier transform and variational quantum circuits with a small depth also may be approximated with a few terms that are easier to simulate on classical computers. Furthermore, we show how the method can be used to simplify quantum Hamiltonians: In particular, we show the application to randomly generated transverse field Ising model Hamiltonians. The reduced Hamiltonians can be mapped into quantum circuits easily and therefore can be simulated more efficiently.
    Constrained Empirical Risk Minimization: Theory and Practice. (arXiv:2302.04729v1 [cs.LG])
    Deep Neural Networks (DNNs) are widely used for their ability to effectively approximate large classes of functions. This flexibility, however, makes the strict enforcement of constraints on DNNs an open problem. Here we present a framework that, under mild assumptions, allows the exact enforcement of constraints on parameterized sets of functions such as DNNs. Instead of imposing "soft'' constraints via additional terms in the loss, we restrict (a subset of) the DNN parameters to a submanifold on which the constraints are satisfied exactly throughout the entire training procedure. We focus on constraints that are outside the scope of equivariant networks used in Geometric Deep Learning. As a major example of the framework, we restrict filters of a Convolutional Neural Network (CNN) to be wavelets, and apply these wavelet networks to the task of contour prediction in the medical domain.
    Delay Sensitive Hierarchical Federated Learning with Stochastic Local Updates. (arXiv:2302.04851v1 [cs.IT])
    The impact of local averaging on the performance of federated learning (FL) systems is studied in the presence of communication delay between the clients and the parameter server. To minimize the effect of delay, clients are assigned into different groups, each having its own local parameter server (LPS) that aggregates its clients' models. The groups' models are then aggregated at a global parameter server (GPS) that only communicates with the LPSs. Such setting is known as hierarchical FL (HFL). Different from most works in the literature, the number of local and global communication rounds in our work is randomly determined by the (different) delays experienced by each group of clients. Specifically, the number of local averaging rounds are tied to a wall-clock time period coined the sync time $S$, after which the LPSs synchronize their models by sharing them with the GPS. Such sync time $S$ is then reapplied until a global wall-clock time is exhausted.
    Real-world Machine Learning Systems: A survey from a Data-Oriented Architecture Perspective. (arXiv:2302.04810v1 [cs.SE])
    With the upsurge of interest in artificial intelligence machine learning (ML) algorithms, originally developed in academic environments, are now being deployed as parts of real-life systems that deal with large amounts of heterogeneous, dynamic, and high-dimensional data. Deployment of ML methods in real life is prone to challenges across the whole system life-cycle from data management to systems deployment, monitoring, and maintenance. Data-Oriented Architecture (DOA) is an emerging software engineering paradigm that has the potential to mitigate these challenges by proposing a set of principles to create data-driven, loosely coupled, decentralised, and open systems. However DOA as a concept is not widespread yet, and there is no common understanding of how it can be realised in practice. This review addresses that problem by contextualising the principles that underpin the DOA paradigm through the ML system challenges. We explore the extent to which current architectures of ML-based real-world systems have implemented the DOA principles. We also formulate open research challenges and directions for further development of the DOA paradigm.
    Domain Generalization by Functional Regression. (arXiv:2302.04724v1 [cs.LG])
    The problem of domain generalization is to learn, given data from different source distributions, a model that can be expected to generalize well on new target distributions which are only seen through unlabeled samples. In this paper, we study domain generalization as a problem of functional regression. Our concept leads to a new algorithm for learning a linear operator from marginal distributions of inputs to the corresponding conditional distributions of outputs given inputs. Our algorithm allows a source distribution-dependent construction of reproducing kernel Hilbert spaces for prediction, and, satisfies finite sample error bounds for the idealized risk. Numerical implementations and source code are available.
    Geometry-Complete Diffusion for 3D Molecule Generation. (arXiv:2302.04313v1 [cs.LG])
    Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods such as those of Hoogeboom et al. 2022 have been proposed for unconditionally generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. Toward this end, we propose GCDM, a geometry-complete diffusion model that achieves new state-of-the-art results for 3D molecule diffusion generation by leveraging the representation learning strengths offered by GNNs that perform geometry-complete message-passing. Our results with GCDM also offer preliminary insights into how physical inductive biases impact the generative dynamics of molecular DDPMs. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/bio-diffusion.
    The Re-Label Method For Data-Centric Machine Learning. (arXiv:2302.04391v1 [cs.LG])
    In industry deep learning application, our manually labeled data has a certain number of noisy data. To solve this problem and achieve more than 90 score in dev dataset, we present a simple method to find the noisy data and re-label the noisy data by human, given the model predictions as references in human labeling. In this paper, we illustrate our idea for a broad set of deep learning tasks, includes classification, sequence tagging, object detection, sequence generation, click-through rate prediction. The experimental results and human evaluation results verify our idea.
    Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization. (arXiv:2302.04552v1 [cs.LG])
    Stochastically Extended Adversarial (SEA) model is introduced by Sachs et al. [2022] as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition, they demonstrate that the expected regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. They also provide a slightly weaker bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, without the convexity requirement of individual functions. For strongly convex and smooth functions, we establish an $\mathcal{O}(\min\{\log (\sigma_{1:T}^2+\Sigma_{1:T}^2), (\sigma_{\max}^2 + \Sigma_{\max}^2) \log T\})$ bound, better than their $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log T)$ bound. For \mbox{exp-concave} and smooth functions, we achieve a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Owing to the OMD framework, we can further extend our result to obtain dynamic regret guarantees, which are more favorable in non-stationary online scenarios. The attained results allow us to recover excess risk bounds of the stochastic setting and regret bounds of the adversarial setting, and derive new guarantees for many intermediate scenarios.
    Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals. (arXiv:2302.04449v1 [cs.LG])
    High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. Auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. When assisted by our design, A2C improves on 4 games in the Atari environment with sparse rewards, and requires 1000x less training frames compared to the previous SOTA Agent 57 on Skiing, the hardest game in Atari.
    New directions in the applications of rough path theory. (arXiv:2302.04586v1 [cs.LG])
    This article provides a concise overview of some of the recent advances in the application of rough path theory to machine learning. Controlled differential equations (CDEs) are discussed as the key mathematical model to describe the interaction of a stream with a physical control system. A collection of iterated integrals known as the signature naturally arises in the description of the response produced by such interactions. The signature comes equipped with a variety of powerful properties rendering it an ideal feature map for streamed data. We summarise recent advances in the symbiosis between deep learning and CDEs, studying the link with RNNs and culminating with the Neural CDE model. We concluded with a discussion on signature kernel methods.
    CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning. (arXiv:2302.04782v1 [cs.LG])
    This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the exploitation (on both expert and diverse data) and exploration (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right exploitation-exploration balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning.
    Rehabilitating Homeless: Dataset and Key Insights. (arXiv:2302.04455v1 [cs.LG])
    This paper presents a large anonymized dataset of homelessness alongside insights into the data-driven rehabilitation of homeless people. The dataset was gathered by a large nonprofit organization working on rehabilitating the homeless for twenty years. This is the first dataset that we know of that contains rich information on thousands of homeless individuals seeking rehabilitation. We show how data analysis can help to make the rehabilitation of homeless people more effective and successful. Thus, we hope this paper alerts the data science community to the problem of homelessness.
    Tree Learning: Optimal Algorithms and Sample Complexity. (arXiv:2302.04492v1 [cs.LG])
    We study the problem of learning a hierarchical tree representation of data from labeled samples, taken from an arbitrary (and possibly adversarial) distribution. Consider a collection of data tuples labeled according to their hierarchical structure. The smallest number of such tuples required in order to be able to accurately label subsequent tuples is of interest for data collection in machine learning. We present optimal sample complexity bounds for this problem in several learning settings, including (agnostic) PAC learning and online learning. Our results are based on tight bounds of the Natarajan and Littlestone dimensions of the associated problem. The corresponding tree classifiers can be constructed efficiently in near-linear time.
    Liver Segmentation in Time-resolved C-arm CT Volumes Reconstructed from Dynamic Perfusion Scans using Time Separation Technique. (arXiv:2302.04585v1 [physics.med-ph])
    Perfusion imaging is a valuable tool for diagnosing and treatment planning for liver tumours. The time separation technique (TST) has been successfully used for modelling C-arm cone-beam computed tomography (CBCT) perfusion data. The reconstruction can be accompanied by the segmentation of the liver - for better visualisation and for generating comprehensive perfusion maps. Recently introduced Turbolift learning has been seen to perform well while working with TST reconstructions, but has not been explored for the time-resolved volumes (TRV) estimated out of TST reconstructions. The segmentation of the TRVs can be useful for tracking the movement of the liver over time. This research explores this possibility by training the multi-scale attention UNet of Turbolift learning at its third stage on the TRVs and shows the robustness of Turbolift learning since it can even work efficiently with the TRVs, resulting in a Dice score of 0.864$\pm$0.004.
    Outlier-Robust Gromov Wasserstein for Graph Data. (arXiv:2302.04610v1 [cs.LG])
    Gromov Wasserstein (GW) distance is a powerful tool for comparing and aligning probability distributions supported on different metric spaces. It has become the main modeling technique for aligning heterogeneous data for a wide range of graph learning tasks. However, the GW distance is known to be highly sensitive to outliers, which can result in large inaccuracies if the outliers are given the same weight as other samples in the objective function. To mitigate this issue, we introduce a new and robust version of the GW distance called RGW. RGW features optimistically perturbed marginal constraints within a $\varphi$-divergence based ambiguity set. To make the benefits of RGW more accessible in practice, we develop a computationally efficient algorithm, Bregman proximal alternating linearization minimization, with a theoretical convergence guarantee. Through extensive experimentation, we validate our theoretical results and demonstrate the effectiveness of RGW on real-world graph learning tasks, such as subgraph matching and partial shape correspondence.
    A Text-guided Protein Design Framework. (arXiv:2302.04611v1 [cs.LG])
    Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level properties. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP that aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that generates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We empirically verify the effectiveness of ProteinDT from three aspects: (1) consistently superior performance on four out of six protein property prediction benchmarks; (2) over 90% accuracy for text-guided protein generation; and (3) promising results for zero-shot text-guided protein editing.
    Weakly Supervised Anomaly Detection: A Survey. (arXiv:2302.04549v1 [cs.LG])
    Anomaly detection (AD) is a crucial task in machine learning with various applications, such as detecting emerging diseases, identifying financial frauds, and detecting fake news. However, obtaining complete, accurate, and precise labels for AD tasks can be expensive and challenging due to the cost and difficulties in data annotation. To address this issue, researchers have developed AD methods that can work with incomplete, inexact, and inaccurate supervision, collectively summarized as weakly supervised anomaly detection (WSAD) methods. In this study, we present the first comprehensive survey of WSAD methods by categorizing them into the above three weak supervision settings across four data modalities (i.e., tabular, graph, time-series, and image/video data). For each setting, we provide formal definitions, key algorithms, and potential future directions. To support future research, we conduct experiments on a selected setting and release the source code, along with a collection of WSAD methods and data.
    Robust and Scalable Bayesian Online Changepoint Detection. (arXiv:2302.04759v1 [stat.ML])
    This paper proposes an online, provably robust, and scalable Bayesian approach for changepoint detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective, and also addresses the scalability issues of previous attempts. Specifically, the proposed generalised Bayesian formalism leads to conjugate posteriors whose parameters are available in closed form by leveraging diffusion score matching. The resulting algorithm is exact, can be updated through simple algebra, and is more than 10 times faster than its closest competitor.
    Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning. (arXiv:2302.04376v1 [cs.LG])
    A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.
    Self-Supervised Node Representation Learning via Node-to-Neighbourhood Alignment. (arXiv:2302.04626v1 [cs.LG])
    Self-supervised node representation learning aims to learn node representations from unlabelled graphs that rival the supervised counterparts. The key towards learning informative node representations lies in how to effectively gain contextual information from the graph structure. In this work, we present simple-yet-effective self-supervised node representation learning via aligning the hidden representations of nodes and their neighbourhood. Our first idea achieves such node-to-neighbourhood alignment by directly maximizing the mutual information between their representations, which, we prove theoretically, plays the role of graph smoothing. Our framework is optimized via a surrogate contrastive loss and a Topology-Aware Positive Sampling (TAPS) strategy is proposed to sample positives by considering the structural dependencies between nodes, which enables offline positive selection. Considering the excessive memory overheads of contrastive learning, we further propose a negative-free solution, where the main contribution is a Graph Signal Decorrelation (GSD) constraint to avoid representation collapse and over-smoothing. The GSD constraint unifies some of the existing constraints and can be used to derive new implementations to combat representation collapse. By applying our methods on top of simple MLP-based node representation encoders, we learn node representations that achieve promising node classification performance on a set of graph-structured datasets from small- to large-scale.
    Short-Term Memory Convolutions. (arXiv:2302.04331v1 [cs.LG])
    The real-time processing of time series signals is a critical issue for many real-life applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.
    Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV Systems. (arXiv:2302.04445v1 [cs.MA])
    This paper proposes a novel quantum multi-agent actor-critic networks (QMACN) algorithm for autonomously constructing a robust mobile access system using multiple unmanned aerial vehicles (UAVs). For the cooperation of multiple UAVs for autonomous mobile access, multi-agent reinforcement learning (MARL) methods are considered. In addition, we also adopt the concept of quantum computing (QC) to improve the training and inference performances. By utilizing QC, scalability and physical issues can happen. However, our proposed QMACN algorithm builds quantum critic and multiple actor networks in order to handle such problems. Thus, our proposed QMACN algorithm verifies the advantage of quantum MARL with remarkable performance improvements in terms of training speed and wireless service quality in various data-intensive evaluations. Furthermore, we validate that a noise injection scheme can be used for handling environmental uncertainties in order to realize robust mobile access. Our data-intensive simulation results verify that our proposed QMACN algorithm outperforms the other existing algorithms.
    Dual Algorithmic Reasoning. (arXiv:2302.04496v1 [cs.LG])
    Neural Algorithmic Reasoning is an emerging area of machine learning which seeks to infuse algorithmic computation in neural networks, typically by training neural models to approximate steps of classical algorithms. In this context, much of the current work has focused on learning reachability and shortest path graph algorithms, showing that joint learning on similar algorithms is beneficial for generalisation. However, when targeting more complex problems, such similar algorithms become more difficult to find. Here, we propose to learn algorithms by exploiting duality of the underlying algorithmic problem. Many algorithms solve optimisation problems. We demonstrate that simultaneously learning the dual definition of these optimisation problems in algorithmic learning allows for better learning and qualitatively better solutions. Specifically, we exploit the max-flow min-cut theorem to simultaneously learn these two algorithms over synthetically generated graphs, demonstrating the effectiveness of the proposed approach. We then validate the real-world utility of our dual algorithmic reasoner by deploying it on a challenging brain vessel classification task, which likely depends on the vessels' flow properties. We demonstrate a clear performance gain when using our model within such a context, and empirically show that learning the max-flow and min-cut algorithms together is critical for achieving such a result.
    Gentlest ascent dynamics on manifolds defined by adaptively sampled point-clouds. (arXiv:2302.04426v1 [math.DS])
    Finding saddle points of dynamical systems is an important problem in practical applications such as the study of rare events of molecular systems. Gentlest ascent dynamics (GAD) is one of a number of algorithms in existence that attempt to find saddle points in dynamical systems. It works by deriving a new dynamical system in which saddle points of the original system become stable equilibria. GAD has been recently generalized to the study of dynamical systems on manifolds (differential algebraic equations) described by equality constraints and given an extrinsic formulation. In this paper, we present an extension of GAD to manifolds defined by point-clouds and formulated intrinsically. These point-clouds are adaptively sampled during an iterative process that drives the system from the initial conformation (typically in the neighborhood of a stable equilibrium) to a saddle point. Our method requires the reactant (initial conformation), does not require the explicit constraint equations to be specified, and is purely data-driven.
    Learning Mixtures of Markov Chains with Quality Guarantees. (arXiv:2302.04680v1 [cs.LG])
    A large number of modern applications ranging from listening songs online and browsing the Web to using a navigation app on a smartphone generate a plethora of user trails. Clustering such trails into groups with a common sequence pattern can reveal significant structure in human behavior that can lead to improving user experience through better recommendations, and even prevent suicides [LMCR14]. One approach to modeling this problem mathematically is as a mixture of Markov chains. Recently, Gupta, Kumar and Vassilvitski [GKV16] introduced an algorithm (GKV-SVD) based on the singular value decomposition (SVD) that under certain conditions can perfectly recover a mixture of L chains on n states, given only the distribution of trails of length 3 (3-trail). In this work we contribute to the problem of unmixing Markov chains by highlighting and addressing two important constraints of the GKV-SVD algorithm [GKV16]: some chains in the mixture may not even be weakly connected, and secondly in practice one does not know beforehand the true number of chains. We resolve these issues in the Gupta et al. paper [GKV16]. Specifically, we propose an algebraic criterion that enables us to choose a value of L efficiently that avoids overfitting. Furthermore, we design a reconstruction algorithm that outputs the true mixture in the presence of disconnected chains and is robust to noise. We complement our theoretical results with experiments on both synthetic and real data, where we observe that our method outperforms the GKV-SVD algorithm. Finally, we empirically observe that combining an EM-algorithm with our method performs best in practice, both in terms of reconstruction error with respect to the distribution of 3-trails and the mixture of Markov Chains.
    Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow. (arXiv:2302.04434v1 [cs.CL])
    Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.
    Augmenting NLP data to counter Annotation Artifacts for NLI Tasks. (arXiv:2302.04700v1 [cs.CL])
    In this paper, we explore Annotation Artifacts - the phenomena wherein large pre-trained NLP models achieve high performance on benchmark datasets but do not actually "solve" the underlying task and instead rely on some dataset artifacts (same across train, validation, and test sets) to figure out the right answer. We explore this phenomenon on the well-known Natural Language Inference task by first using contrast and adversarial examples to understand limitations to the model's performance and show one of the biases arising from annotation artifacts (the way training data was constructed by the annotators). We then propose a data augmentation technique to fix this bias and measure its effectiveness.
    SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks. (arXiv:2302.04852v1 [cs.LG])
    We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse. Our algorithm is general, as it applies to arbitrary (unstructured) sparsity and common layer types (e.g., convolutional or linear). We provide a fast vectorized implementation on commodity CPUs, and show that it can yield speedups in end-to-end runtime experiments, both in transfer learning using already-sparsified networks, and in training sparse networks from scratch. Thus, our results provide the first support for sparse training on commodity hardware.
    Disentangling Learning Representations with Density Estimation. (arXiv:2302.04362v1 [cs.LG])
    Disentangled learning representations have promising utility in many applications, but they currently suffer from serious reliability issues. We present Gaussian Channel Autoencoder (GCAE), a method which achieves reliable disentanglement via flexible density estimation of the latent space. GCAE avoids the curse of dimensionality of density estimation by disentangling subsets of its latent space with the Dual Total Correlation (DTC) metric, thereby representing its high-dimensional latent joint distribution as a collection of many low-dimensional conditional distributions. In our experiments, GCAE achieves highly competitive and reliable disentanglement scores compared with state-of-the-art baselines.
    Generalization in Graph Neural Networks: Improved PAC-Bayesian Bounds on Graph Diffusion. (arXiv:2302.04451v1 [cs.LG])
    Graph neural networks are widely used tools for graph prediction tasks. Motivated by their empirical performance, prior works have developed generalization bounds for graph neural networks, which scale with graph structures in terms of the maximum degree. In this paper, we present generalization bounds that instead scale with the largest singular value of the graph neural network's feature diffusion matrix. These bounds are numerically much smaller than prior bounds for real-world graphs. We also construct a lower bound of the generalization gap that matches our upper bound asymptotically. To achieve these results, we analyze a unified model that includes prior works' settings (i.e., convolutional and message-passing networks) and new settings (i.e., graph isomorphism networks). Our key idea is to measure the stability of graph neural networks against noise perturbations using Hessians. Empirically, we find that Hessian-based measurements correlate with the observed generalization gaps of graph neural networks accurately; Optimizing noise stability properties for fine-tuning pretrained graph neural networks also improves test performance on several graph-level classification tasks.
    Flat Seeking Bayesian Neural Networks. (arXiv:2302.02713v2 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) offer a probabilistic interpretation for deep learning models by imposing a prior distribution over model parameters and inferencing a posterior distribution based on observed data. The model sampled from the posterior distribution can be used for providing ensemble predictions and quantifying prediction uncertainty. It is well-known that deep learning models with a lower sharpness have a better generalization ability. Nonetheless, existing posterior inferences are not aware of sharpness/flatness, hence possibly leading to high sharpness for the models sampled from it. In this paper, we develop theories, the Bayesian setting, and the variational inference approach for the sharpness-aware posterior. Specifically, the models sampled from our sharpness-aware posterior and the optimal approximate posterior estimating this sharpness-aware posterior have a better flatness, hence possibly possessing a higher generalization ability. We conduct experiments by leveraging the sharpness-aware posterior with the state-of-the-art Bayesian Neural Networks, showing that the flat-seeking counterparts outperform their baselines in all metrics of interest.
    Efficient displacement convex optimization with particle gradient descent. (arXiv:2302.04753v1 [cs.LG])
    Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Concretely, for Lipschitz displacement convex functions defined on probability over $\mathbb{R}^d$, we prove that $O(1/\epsilon^2)$ particles and $O(d/\epsilon^4)$ computations are sufficient to find the $\epsilon$-optimal solutions. We further provide improved complexity bounds for optimizing smooth displacement convex functions. We demonstrate the application of our results for function approximation with specific neural architectures with two-dimensional inputs.
    Exploiting Certified Defences to Attack Randomised Smoothing. (arXiv:2302.04379v1 [cs.LG])
    In guaranteeing that no adversarial examples exist within a bounded region, certification mechanisms play an important role in neural network robustness. Concerningly, this work demonstrates that the certification mechanisms themselves introduce a new, heretofore undiscovered attack surface, that can be exploited by attackers to construct smaller adversarial perturbations. While these attacks exist outside the certification region in no way invalidate certifications, minimising a perturbation's norm significantly increases the level of difficulty associated with attack detection. In comparison to baseline attacks, our new framework yields smaller perturbations more than twice as frequently as any other approach, resulting in an up to $34 \%$ reduction in the median perturbation norm. That this approach also requires $90 \%$ less computational time than approaches like PGD. That these reductions are possible suggests that exploiting this new attack vector would allow attackers to more frequently construct hard to detect adversarial attacks, by exploiting the very systems designed to defend deployed models.
    An End-to-End Framework for Marketing Effectiveness Optimization under Budget Constraint. (arXiv:2302.04477v1 [cs.LG])
    Online platforms often incentivize consumers to improve user engagement and platform revenue. Since different consumers might respond differently to incentives, individual-level budget allocation is an essential task in marketing campaigns. Recent advances in this field often address the budget allocation problem using a two-stage paradigm: the first stage estimates the individual-level treatment effects using causal inference algorithms, and the second stage invokes integer programming techniques to find the optimal budget allocation solution. Since the objectives of these two stages might not be perfectly aligned, such a two-stage paradigm could hurt the overall marketing effectiveness. In this paper, we propose a novel end-to-end framework to directly optimize the business goal under budget constraints. Our core idea is to construct a regularizer to represent the marketing goal and optimize it efficiently using gradient estimation techniques. As such, the obtained models can learn to maximize the marketing goal directly and precisely. We extensively evaluate our proposed method in both offline and online experiments, and experimental results demonstrate that our method outperforms current state-of-the-art methods. Our proposed method is currently deployed to allocate marketing budgets for hundreds of millions of users on a short video platform and achieves significant business goal improvements. Our code will be publicly available.
    NeuKron: Constant-Size Lossy Compression of Sparse Reorderable Matrices and Tensors. (arXiv:2302.04570v1 [cs.LG])
    Many real-world data are naturally represented as a sparse reorderable matrix, whose rows and columns can be arbitrarily ordered (e.g., the adjacency matrix of a bipartite graph). Storing a sparse matrix in conventional ways requires an amount of space linear in the number of non-zeros, and lossy compression of sparse matrices (e.g., Truncated SVD) typically requires an amount of space linear in the number of rows and columns. In this work, we propose NeuKron for compressing a sparse reorderable matrix into a constant-size space. NeuKron generalizes Kronecker products using a recurrent neural network with a constant number of parameters. NeuKron updates the parameters so that a given matrix is approximated by the product and reorders the rows and columns of the matrix to facilitate the approximation. The updates take time linear in the number of non-zeros in the input matrix, and the approximation of each entry can be retrieved in logarithmic time. We also extend NeuKron to compress sparse reorderable tensors (e.g. multi-layer graphs), which generalize matrices. Through experiments on ten real-world datasets, we show that NeuKron is (a) Compact: requiring up to five orders of magnitude less space than its best competitor with similar approximation errors, (b) Accurate: giving up to 10x smaller approximation error than its best competitors with similar size outputs, and (c) Scalable: successfully compressing a matrix with over 230 million non-zero entries.
    Nonlinear Random Matrices and Applications to the Sum of Squares Hierarchy. (arXiv:2302.04462v1 [cs.CC])
    We develop new tools in the theory of nonlinear random matrices and apply them to study the performance of the Sum of Squares (SoS) hierarchy on average-case problems. The SoS hierarchy is a powerful optimization technique that has achieved tremendous success for various problems in combinatorial optimization, robust statistics and machine learning. It's a family of convex relaxations that lets us smoothly trade off running time for approximation guarantees. In recent works, it's been shown to be extremely useful for recovering structure in high dimensional noisy data. It also remains our best approach towards refuting the notorious Unique Games Conjecture. In this work, we analyze the performance of the SoS hierarchy on fundamental problems stemming from statistics, theoretical computer science and statistical physics. In particular, we show subexponential-time SoS lower bounds for the problems of the Sherrington-Kirkpatrick Hamiltonian, Planted Slightly Denser Subgraph, Tensor Principal Components Analysis and Sparse Principal Components Analysis. These SoS lower bounds involve analyzing large random matrices, wherein lie our main contributions. These results offer strong evidence for the truth of and insight into the low-degree likelihood ratio hypothesis, an important conjecture that predicts the power of bounded-time algorithms for hypothesis testing. We also develop general-purpose tools for analyzing the behavior of random matrices which are functions of independent random variables. Towards this, we build on and generalize the matrix variant of the Efron-Stein inequalities. In particular, our general theorem on matrix concentration recovers various results that have appeared in the literature. We expect these random matrix theory ideas to have other significant applications.
    Towards Fairer and More Efficient Federated Learning via Multidimensional Personalized Edge Models. (arXiv:2302.04464v1 [cs.LG])
    Federated learning (FL) is an emerging technique that trains massive and geographically distributed edge data while maintaining privacy. However, FL has inherent challenges in terms of fairness and computational efficiency due to the rising heterogeneity of edges, and thus usually result in sub-optimal performance in recent state-of-the-art (SOTA) solutions. In this paper, we propose a Customized Federated Learning (CFL) system to eliminate FL heterogeneity from multiple dimensions. Specifically, CFL tailors personalized models from the specially designed global model for each client, jointly guided an online trained model-search helper and a novel aggregation algorithm. Extensive experiments demonstrate that CFL has full-stack advantages for both FL training and edge reasoning and significantly improves the SOTA performance w.r.t. model accuracy (up to 7.2% in the non-heterogeneous environment and up to 21.8% in the heterogeneous environment), efficiency, and FL fairness.
    REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines. (arXiv:2302.04702v1 [cs.DB])
    Nowadays, machine learning (ML) plays a vital role in many aspects of our daily life. In essence, building well-performing ML applications requires the provision of high-quality data throughout the entire life-cycle of such applications. Nevertheless, most of the real-world tabular data suffer from different types of discrepancies, such as missing values, outliers, duplicates, pattern violation, and inconsistencies. Such discrepancies typically emerge while collecting, transferring, storing, and/or integrating the data. To deal with these discrepancies, numerous data cleaning methods have been introduced. However, the majority of such methods broadly overlook the requirements imposed by downstream ML models. As a result, the potential of utilizing these data cleaning methods in ML pipelines is predominantly unrevealed. In this work, we introduce a comprehensive benchmark, called REIN1, to thoroughly investigate the impact of data cleaning methods on various ML models. Through the benchmark, we provide answers to important research questions, e.g., where and whether data cleaning is a necessary step in ML pipelines. To this end, the benchmark examines 38 simple and advanced error detection and repair methods. To evaluate these methods, we utilized a wide collection of ML models trained on 14 publicly-available datasets covering different domains and encompassing realistic as well as synthetic error profiles.
    A Comparison of Decision Forest Inference Platforms from A Database Perspective. (arXiv:2302.04430v1 [cs.DB])
    Decision forest, including RandomForest, XGBoost, and LightGBM, is one of the most popular machine learning techniques used in many industrial scenarios, such as credit card fraud detection, ranking, and business intelligence. Because the inference process is usually performance-critical, a number of frameworks were developed and dedicated for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. However, these frameworks are all decoupled with data management frameworks. It is unclear whether in-database inference will improve the overall performance. In addition, these frameworks used different algorithms, optimization techniques, and parallelism models. It is unclear how these implementations will affect the overall performance and how to make design decisions for an in-database inference framework. In this work, we investigated the above questions by comprehensively comparing the end-to-end performance of the aforementioned inference frameworks and netsDB, an in-database inference framework we implemented. Through this study, we identified that netsDB is best suited for handling small-scale models on large-scale datasets and all-scale models on small-scale datasets, for which it achieved up to hundreds of times of speedup. In addition, the relation-centric representation we proposed significantly improved netsDB's performance in handling large-scale models, while the model reuse optimization we proposed further improved netsDB's performance in handling small-scale datasets.
    Geometry of Score Based Generative Models. (arXiv:2302.04411v1 [cs.LG])
    In this work, we look at Score-based generative models (also called diffusion generative models) from a geometric perspective. From a new view point, we prove that both the forward and backward process of adding noise and generating from noise are Wasserstein gradient flow in the space of probability measures. We are the first to prove this connection. Our understanding of Score-based (and Diffusion) generative models have matured and become more complete by drawing ideas from different fields like Bayesian inference, control theory, stochastic differential equation and Schrodinger bridge. However, many open questions and challenges remain. One problem, for example, is how to decrease the sampling time? We demonstrate that looking from geometric perspective enables us to answer many of these questions and provide new interpretations to some known results. Furthermore, geometric perspective enables us to devise an intuitive geometric solution to the problem of faster sampling. By augmenting traditional score-based generative models with a projection step, we show that we can generate high quality images with significantly fewer sampling-steps.
    Towards Model-Agnostic Federated Learning over Networks. (arXiv:2302.04363v1 [cs.LG])
    We present a model-agnostic federated learning method for decentralized data with an intrinsic network structure. The network structure reflects similarities between the (statistics of) local datasets and, in turn, their associated local models. Our method is an instance of empirical risk minimization, using a regularization term that is constructed from the network structure of data. In particular, we require well-connected local models, forming clusters, to yield similar predictions on a common test set. In principle our method can be applied to any collection of local models. The only restriction put on these local models is that they allow for efficient implementation of regularized empirical risk minimization (training). Such implementations might be available in the form of high-level programming frameworks such as \texttt{scikit-learn}, \texttt{Keras} or \texttt{PyTorch}.
    Data Quality-aware Mixed-precision Quantization via Hybrid Reinforcement Learning. (arXiv:2302.04453v1 [cs.AI])
    Mixed-precision quantization mostly predetermines the model bit-width settings before actual training due to the non-differential bit-width sampling process, obtaining sub-optimal performance. Worse still, the conventional static quality-consistent training setting, i.e., all data is assumed to be of the same quality across training and inference, overlooks data quality changes in real-world applications which may lead to poor robustness of the quantized models. In this paper, we propose a novel Data Quality-aware Mixed-precision Quantization framework, dubbed DQMQ, to dynamically adapt quantization bit-widths to different data qualities. The adaption is based on a bit-width decision policy that can be learned jointly with the quantization training. Concretely, DQMQ is modeled as a hybrid reinforcement learning (RL) task that combines model-based policy optimization with supervised quantization training. By relaxing the discrete bit-width sampling to a continuous probability distribution that is encoded with few learnable parameters, DQMQ is differentiable and can be directly optimized end-to-end with a hybrid optimization target considering both task performance and quantization benefits. Trained on mixed-quality image datasets, DQMQ can implicitly select the most proper bit-width for each layer when facing uneven input qualities. Extensive experiments on various benchmark datasets and networks demonstrate the superiority of DQMQ against existing fixed/mixed-precision quantization methods.
    MedDiff: Generating Electronic Health Records using Accelerated Denoising Diffusion Model. (arXiv:2302.04355v1 [cs.LG])
    Due to patient privacy protection concerns, machine learning research in healthcare has been undeniably slower and limited than in other application domains. High-quality, realistic, synthetic electronic health records (EHRs) can be leveraged to accelerate methodological developments for research purposes while mitigating privacy concerns associated with data sharing. The current state-of-the-art model for synthetic EHR generation is generative adversarial networks, which are notoriously difficult to train and can suffer from mode collapse. Denoising Diffusion Probabilistic Models, a class of generative models inspired by statistical thermodynamics, have recently been shown to generate high-quality synthetic samples in certain domains. It is unknown whether these can generalize to generation of large-scale, high-dimensional EHRs. In this paper, we present a novel generative model based on diffusion models that is the first successful application on electronic health records. Our model proposes a mechanism to perform class-conditional sampling to preserve label information. We also introduce a new sampling strategy to accelerate the inference speed. We empirically show that our model outperforms existing state-of-the-art synthetic EHR generation methods.
    Discovering interpretable Lagrangian of dynamical systems from data. (arXiv:2302.04400v1 [stat.ML])
    A complete understanding of physical systems requires models that are accurate and obeys natural conservation laws. Recent trends in representation learning involve learning Lagrangian from data rather than the direct discovery of governing equations of motion. The generalization of equation discovery techniques has huge potential; however, existing Lagrangian discovery frameworks are black-box in nature. This raises a concern about the reusability of the discovered Lagrangian. In this article, we propose a novel data-driven machine-learning algorithm to automate the discovery of interpretable Lagrangian from data. The Lagrangian are derived in interpretable forms, which also allows the automated discovery of conservation laws and governing equations of motion. The architecture of the proposed framework is designed in such a way that it allows learning the Lagrangian from a subset of the underlying domain and then generalizing for an infinite-dimensional system. The fidelity of the proposed framework is exemplified using examples described by systems of ordinary differential equations and partial differential equations where the Lagrangian and conserved quantities are known.
    Multi-task Representation Learning for Pure Exploration in Linear Bandits. (arXiv:2302.04441v1 [cs.LG])
    Despite the recent success of representation learning in sequential decision making, the study of the pure exploration scenario (i.e., identify the best option and minimize the sample complexity) is still limited. In this paper, we study multi-task representation learning for best arm identification in linear bandits (RepBAI-LB) and best policy identification in contextual linear bandits (RepBPI-CLB), two popular pure exploration settings with wide applications, e.g., clinical trials and web content optimization. In these two problems, all tasks share a common low-dimensional linear representation, and our goal is to leverage this feature to accelerate the best arm (policy) identification process for all tasks. For these problems, we design computationally and sample efficient algorithms DouExpDes and C-DouExpDes, which perform double experimental designs to plan optimal sample allocations for learning the global representation. We show that by learning the common representation among tasks, our sample complexity is significantly better than that of the native approach which solves tasks independently. To the best of our knowledge, this is the first work to demonstrate the benefits of representation learning for multi-task pure exploration.
    Partial Optimality in Cubic Correlation Clustering. (arXiv:2302.04694v1 [cs.DM])
    The higher-order correlation clustering problem is an expressive model, and recently, local search heuristics have been proposed for several applications. Certifying optimality, however, is NP-hard and practically hampered already by the complexity of the problem statement. Here, we focus on establishing partial optimality conditions for the special case of complete graphs and cubic objective functions. In addition, we define and implement algorithms for testing these conditions and examine their effect numerically, on two datasets.
    On Fairness and Stability: Is Estimator Variance a Friend or a Foe?. (arXiv:2302.04525v1 [cs.LG])
    The error of an estimator can be decomposed into a (statistical) bias term, a variance term, and an irreducible noise term. When we do bias analysis, formally we are asking the question: "how good are the predictions?" The role of bias in the error decomposition is clear: if we trust the labels/targets, then we would want the estimator to have as low bias as possible, in order to minimize error. Fair machine learning is concerned with the question: "Are the predictions equally good for different demographic/social groups?" This has naturally led to a variety of fairness metrics that compare some measure of statistical bias on subsets corresponding to socially privileged and socially disadvantaged groups. In this paper we propose a new family of performance measures based on group-wise parity in variance. We demonstrate when group-wise statistical bias analysis gives an incomplete picture, and what group-wise variance analysis can tell us in settings that differ in the magnitude of statistical bias. We develop and release an open-source library that reconciles uncertainty quantification techniques with fairness analysis, and use it to conduct an extensive empirical analysis of our variance-based fairness measures on standard benchmarks.
    Hubbard-Stratonovich Detector for Simple Trainable MIMO Signal Detection. (arXiv:2302.04461v1 [cs.IT])
    Massive multiple-input multiple-output (MIMO) is a key technology used in fifth-generation wireless communication networks and beyond. Recently, various MIMO signal detectors based on deep learning have been proposed. Especially, deep unfolding (DU), which involves unrolling of an existing iterative algorithm and embedding of trainable parameters, has been applied with remarkable detection performance. Although DU has a lesser number of trainable parameters than conventional deep neural networks, the computational complexities related to training and execution have been problematic because DU-based MIMO detectors usually utilize matrix inversion to improve their detection performance. In this study, we attempted to construct a DU-based trainable MIMO detector with the simplest structure. The proposed detector based on the Hubbard--Stratonovich (HS) transformation and DU is called the trainable HS (THS) detector. It requires only $O(1)$ trainable parameters and its training and execution cost is $O(n^2)$ per iteration, where $n$ is the number of transmitting antennas. Numerical results show that the detection performance of the THS detector is better than that of existing algorithms of the same complexity and close to that of a DU-based detector, which has higher training and execution costs than the THS detector.
    Light and Accurate: Neural Architecture Search via Two Constant Shared Weights Initialisations. (arXiv:2302.04406v1 [cs.LG])
    In recent years, zero-cost proxies are gaining ground in neural architecture search (NAS). These methods allow finding the optimal neural network for a given task faster and with a lesser computational load than conventional NAS methods. Equally important is the fact that they also shed some light on the internal workings of neural architectures. This paper presents a zero-cost metric that highly correlates with the train set accuracy across the NAS-Bench-101, NAS-Bench-201 and NAS-Bench-NLP benchmark datasets. Architectures are initialised with two distinct constant shared weights, one at a time. Then, a fixed random mini-batch of data is passed forward through each initialisation. We observe that the dispersion of the outputs between two initialisations positively correlates with trained accuracy. The correlation further improves when we normalise dispersion by average output magnitude. Our metric, epsilon, does not require gradients computation or labels. It thus unbinds the NAS procedure from training hyperparameters, loss metrics and human-labelled data. Our method is easy to integrate within existing NAS algorithms and takes a fraction of a second to evaluate a single network.
    Adaptive State-Dependent Diffusion for Derivative-Free Optimization. (arXiv:2302.04370v1 [math.OC])
    This paper develops and analyzes a stochastic derivative-free optimization strategy. A key feature is the state-dependent adaptive variance. We prove global convergence in probability with algebraic rate and give the quantitative results in numerical examples. A striking fact is that convergence is achieved without explicit information of the gradient and even without comparing different objective function values as in established methods such as the simplex method and simulated annealing. It can otherwise be compared to annealing with state-dependent temperature.
    Liver Segmentation using Turbolift Learning for CT and Cone-beam C-arm Perfusion Imaging. (arXiv:2207.10167v2 [eess.IV] UPDATED)
    Model-based reconstruction employing the time separation technique (TST) was found to improve dynamic perfusion imaging of the liver using C-arm cone-beam computed tomography (CBCT). To apply TST using prior knowledge extracted from CT perfusion data, the liver should be accurately segmented from the CT scans. Reconstructions of primary and model-based CBCT data need to be segmented for proper visualisation and interpretation of perfusion maps. This research proposes Turbolift learning, which trains a modified version of the multi-scale Attention UNet on different liver segmentation tasks serially, following the order of the trainings CT, CBCT, CBCT TST - making the previous trainings act as pre-training stages for the subsequent ones - addressing the problem of limited number of datasets for training. For the final task of liver segmentation from CBCT TST, the proposed method achieved an overall Dice scores of 0.874$\pm$0.031 and 0.905$\pm$0.007 in 6-fold and 4-fold cross-validation experiments, respectively - securing statistically significant improvements over the model, which was trained only for that task. Experiments revealed that Turbolift not only improves the overall performance of the model but also makes it robust against artefacts originating from the embolisation materials and truncation artefacts. Additionally, in-depth analyses confirmed the order of the segmentation tasks. This paper shows the potential of segmenting the liver from CT, CBCT, and CBCT TST, learning from the available limited training data, which can possibly be used in the future for the visualisation and evaluation of the perfusion maps for the treatment evaluation of liver diseases.
    Learning Dynamical Systems by Leveraging Data from Similar Systems. (arXiv:2302.04344v1 [stat.ML])
    We consider the problem of learning the dynamics of a linear system when one has access to data generated by an auxiliary system that shares similar (but not identical) dynamics, in addition to data from the true system. We use a weighted least squares approach, and provide a finite sample error bound of the learned model as a function of the number of samples and various system parameters from the two systems as well as the weight assigned to the auxiliary data. We show that the auxiliary data can help to reduce the intrinsic system identification error due to noise, at the price of adding a portion of error that is due to the differences between the two system models. We further provide a data-dependent bound that is computable when some prior knowledge about the systems is available. This bound can also be used to determine the weight that should be assigned to the auxiliary data during the model training stage.
    Bag of Tricks for Training Data Extraction from Language Models. (arXiv:2302.04460v1 [cs.CL])
    With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty of this task, most of the existing methods are proof-of-concept and still not effective enough. In this paper, we investigate and benchmark tricks for improving training data extraction using a publicly available dataset. Because most existing extraction methods use a pipeline of generating-then-ranking, i.e., generating text candidates as potential training data and then ranking them based on specific criteria, our research focuses on the tricks for both text generation (e.g., sampling strategy) and text ranking (e.g., token-level criteria). The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction. Based on the GPT-Neo 1.3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research.
    On Computable Online Learning. (arXiv:2302.04357v1 [cs.LG])
    We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for "optimality" in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of "optimality" and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.
    Machine Learning Capability: A standardized metric using case difficulty with applications to individualized deployment of supervised machine learning. (arXiv:2302.04386v1 [cs.LG])
    Model evaluation is a critical component in supervised machine learning classification analyses. Traditional metrics do not currently incorporate case difficulty. This renders the classification results unbenchmarked for generalization. Item Response Theory (IRT) and Computer Adaptive Testing (CAT) with machine learning can benchmark datasets independent of the end-classification results. This provides high levels of case-level information regarding evaluation utility. To showcase, two datasets were used: 1) health-related and 2) physical science. For the health dataset a two-parameter IRT model, and for the physical science dataset a polytonomous IRT model, was used to analyze predictive features and place each case on a difficulty continuum. A CAT approach was used to ascertain the algorithms' performance and applicability to new data. This method provides an efficient way to benchmark data, using only a fraction of the dataset (less than 1%) and 22-60x more computationally efficient than traditional metrics. This novel metric, termed Machine Learning Capability (MLC) has additional benefits as it is unbiased to outcome classification and a standardized way to make model comparisons within and across datasets. MLC provides a metric on the limitation of supervised machine learning algorithms. In situations where the algorithm falls short, other input(s) are required for decision-making.
    Measuring the Privacy Leakage via Graph Reconstruction Attacks on Simplicial Neural Networks (Student Abstract). (arXiv:2302.04373v1 [cs.LG])
    In this paper, we measure the privacy leakage via studying whether graph representations can be inverted to recover the graph used to generate them via graph reconstruction attack (GRA). We propose a GRA that recovers a graph's adjacency matrix from the representations via a graph decoder that minimizes the reconstruction loss between the partial graph and the reconstructed graph. We study three types of representations that are trained on the graph, i.e., representations output from graph convolutional network (GCN), graph attention network (GAT), and our proposed simplicial neural network (SNN) via a higher-order combinatorial Laplacian. Unlike the first two types of representations that only encode pairwise relationships, the third type of representation, i.e., SNN outputs, encodes higher-order interactions (e.g., homological features) between nodes. We find that the SNN outputs reveal the lowest privacy-preserving ability to defend the GRA, followed by those of GATs and GCNs, which indicates the importance of building more private representations with higher-order node information that could defend the potential threats, such as GRAs.
    Neonatal Face and Facial Landmark Detection from Video Recordings. (arXiv:2302.04341v1 [eess.IV])
    This paper explores automated face and facial landmark detection of neonates, which is an important first step in many video-based neonatal health applications, such as vital sign estimation, pain assessment, sleep-wake classification, and jaundice detection. Utilising three publicly available datasets of neonates in the clinical environment, 366 images (258 subjects) and 89 (66 subjects) were annotated for training and testing, respectively. Transfer learning was applied to two YOLO-based models, with input training images augmented with random horizontal flipping, photo-metric colour distortion, translation and scaling during each training epoch. Additionally, the re-orientation of input images and fusion of trained deep learning models was explored. Our proposed model based on YOLOv7Face outperformed existing methods with a mean average precision of 84.8% for face detection, and a normalised mean error of 0.072 for facial landmark detection. Overall, this will assist in the development of fully automated neonatal health assessment algorithms.
    (Re)Defining Expertise in Machine Learning Development. (arXiv:2302.04337v1 [cs.LG])
    Domain experts are often engaged in the development of machine learning systems in a variety of ways, such as in data collection and evaluation of system performance. At the same time, who counts as an 'expert' and what constitutes 'expertise' is not always explicitly defined. In this project, we conduct a systematic literature review of machine learning research to understand 1) the bases on which expertise is defined and recognized and 2) the roles experts play in ML development. Our goal is to produce a high-level taxonomy to highlight limits and opportunities in how experts are identified and engaged in ML research.
    An Investigation into Pre-Training Object-Centric Representations for Reinforcement Learning. (arXiv:2302.04419v1 [cs.LG])
    Unsupervised object-centric representation (OCR) learning has recently drawn attention as a new paradigm of visual representation. This is because of its potential of being an effective pre-training technique for various downstream tasks in terms of sample efficiency, systematic generalization, and reasoning. Although image-based reinforcement learning (RL) is one of the most important and thus frequently mentioned such downstream tasks, the benefit in RL has surprisingly not been investigated systematically thus far. Instead, most of the evaluations have focused on rather indirect metrics such as segmentation quality and object property prediction accuracy. In this paper, we investigate the effectiveness of OCR pre-training for image-based reinforcement learning via empirical experiments. For systematic evaluation, we introduce a simple object-centric visual RL benchmark and conduct experiments to answer questions such as ``Does OCR pre-training improve performance on object-centric tasks?'' and ``Can OCR pre-training help with out-of-distribution generalization?''. Our results provide empirical evidence for valuable insights into the effectiveness of OCR pre-training for RL and the potential limitations of its use in certain scenarios. Additionally, this study also examines the critical aspects of incorporating OCR pre-training in RL, including performance in a visually complex environment and the appropriate pooling layer to aggregate the object representations.
    Importance Sampling Deterministic Annealing for Clustering. (arXiv:2302.04421v1 [stat.ML])
    A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in some real-world scenarios. In this paper, we propose an importance sampling based deterministic annealing approach (ISDA) for clustering problems which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method and be solved by the quasi-newton algorithm. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed ISDA. Furthermore, we show that fuzzy c-means is a special case of ISDA with the logarithmic distortion. This observation sheds a new light on the relationship between fuzzy c-means and deterministic annealing clustering algorithms and provides an interesting physical and information-theoretical interpretation for fuzzy exponent $m$.
    A data variation robust learning model based on importance sampling. (arXiv:2302.04438v1 [stat.ML])
    A crucial assumption underlying the most current theory of machine learning is that the training distribution is identical to the testing distribution. However, this assumption may not hold in some real-world applications. In this paper, we propose an importance sampling based data variation robust loss (ISloss) for learning problems which minimizes the worst case of loss under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from the importance sampling method. Furthermore, we reveal that there is a relationship between ISloss under the logarithmic transformation (LogISloss) and the p-norm loss. We apply the proposed LogISloss to the face verification problem on Racial Faces in the Wild dataset and show that the proposed method is robust under large distribution deviations.
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v2 [stat.ML] UPDATED)
    We consider the differentially private estimation of multiple quantiles (MQ) of a distribution from a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem. We establish that the resulting method is closely related to the recently published ad hoc algorithm JointExp. In particular, they share the same computational complexity and a similar efficiency. We prove the statistical consistency of these two algorithms for continuous distributions. Furthermore, we demonstrate both theoretically and empirically that this method suffers from an important lack of performance in the case of peaked distributions, which can degrade up to a potentially catastrophic impact in the presence of atoms. Its smoothed version (i.e. by applying a max kernel to its output density) would solve this problem, but remains an open challenge to implement. As a proxy, we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.
    Learning by Asking for Embodied Visual Navigation and Task Completion. (arXiv:2302.04865v1 [cs.CV])
    The research community has shown increasing interest in designing intelligent embodied agents that can assist humans in accomplishing tasks. Despite recent progress on related vision-language benchmarks, most prior work has focused on building agents that follow instructions rather than endowing agents the ability to ask questions to actively resolve ambiguities arising naturally in embodied environments. To empower embodied agents with the ability to interact with humans, in this work, we propose an Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task. We evaluate our model on the TEACH vision-dialog navigation and task completion dataset. Experimental results show that ELBA achieves improved task performance compared to baseline models without question-answering capabilities.
    TPU-MLIR: A Compiler For TPU Using MLIR. (arXiv:2210.15016v2 [cs.PL] UPDATED)
    Multi-level intermediate representations (MLIR) show great promise for reducing the cost of building domain-specific compilers by providing a reusable and extensible compiler infrastructure. This work presents TPU-MLIR, an end-to-end compiler based on MLIR that deploys pre-trained neural network (NN) models to a custom ASIC called a Tensor Processing Unit (TPU). TPU-MLIR defines two new dialects to implement its functionality: 1. a Tensor operation (TOP) dialect that encodes the deep learning graph semantics and independent of the deep learning framework and 2. a TPU kernel dialect to provide a standard kernel computation on TPU. A NN model is translated to the TOP dialect and then lowered to the TPU dialect for different TPUs according to the chip's configuration. We demonstrate how to use the MLIR pass pipeline to organize and perform optimization on TPU to generate machine code. The paper also presents a verification procedure to ensure the correctness of each transform stage.
    A Survey of Knowledge Tracing. (arXiv:2105.15106v3 [cs.CY] UPDATED)
    High-quality education is one of the keys to achieving a more sustainable world. In contrast to traditional face-to-face classroom education, online education enables us to record and research a large amount of learning data for offering intelligent educational services. Knowledge Tracing (KT), which aims to monitor students' evolving knowledge state in learning, is the fundamental task to support these intelligent services. In recent years, an increasing amount of research is focused on this emerging field and considerable progress has been made. In this survey, we categorize existing KT models from a technical perspective and investigate these models in a systematic manner. Subsequently, we review abundant variants of KT models that consider more strict learning assumptions from three phases: before, during, and after learning. To better facilitate researchers and practitioners working on this field, we open source two algorithm libraries: EduData for downloading and preprocessing KT-related datasets, and EduKTM with extensible and unified implementation of existing mainstream KT models. Moreover, the development of KT cannot be separated from its applications, therefore we further present typical KT applications in different scenarios. Finally, we discuss some potential directions for future research in this fast-growing field.
    What are the mechanisms underlying metacognitive learning?. (arXiv:2302.04840v1 [cs.LG])
    How is it that humans can solve complex planning tasks so efficiently despite limited cognitive resources? One reason is its ability to know how to use its limited computational resources to make clever choices. We postulate that people learn this ability from trial and error (metacognitive reinforcement learning). Here, we systematize models of the underlying learning mechanisms and enhance them with more sophisticated additional mechanisms. We fit the resulting 86 models to human data collected in previous experiments where different phenomena of metacognitive learning were demonstrated and performed Bayesian model selection. Our results suggest that a gradient ascent through the space of cognitive strategies can explain most of the observed qualitative phenomena, and is therefore a promising candidate for explaining the mechanism underlying metacognitive learning.
    Online Subset Selection using $\alpha$-Core with no Augmented Regret. (arXiv:2209.14222v3 [cs.LG] UPDATED)
    We revisit the classic problem of optimal subset selection in the online learning set-up. Assume that the set $[N]$ consists of $N$ distinct elements. On the $t$th round, an adversary chooses a monotone reward function $f_t: 2^{[N]} \to \mathbb{R}_+$ that assigns a non-negative reward to each subset of $[N].$ An online policy selects (perhaps randomly) a subset $S_t \subseteq [N]$ consisting of $k$ elements before the reward function $f_t$ for the $t$th round is revealed to the learner. As a consequence of its choice, the policy receives a reward of $f_t(S_t)$ on the $t$th round. Our goal is to design an online sequential subset selection policy to maximize the expected cumulative reward accumulated over a time horizon. In this connection, we propose an online learning policy called SCore (Subset Selection with Core) that solves the problem for a large class of reward functions. The proposed SCore policy is based on a new polyhedral characterization of the reward functions called $\alpha$-Core - a generalization of Core from the cooperative game theory literature. We establish a learning guarantee for the SCore policy in terms of a new performance metric called $\alpha$-augmented regret. In this new metric, the performance of the online policy is compared with an unrestricted offline benchmark that can select all $N$ elements at every round. We show that a large class of reward functions, including submodular, can be efficiently optimized with the SCore policy. We also extend the proposed policy to the optimistic learning set-up where the learner has access to additional untrusted hints regarding the reward functions. Finally, we conclude the paper with a list of open problems.
    CausalEGM: a general causal inference framework by encoding generative modeling. (arXiv:2212.05925v3 [stat.ML] UPDATED)
    Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM.
    rMultiNet: An R Package For Multilayer Networks Analysis. (arXiv:2302.04437v1 [stat.ML])
    This paper develops an R package rMultiNet to analyze multilayer network data. We provide two general frameworks from recent literature, e.g. mixture multilayer stochastic block model(MMSBM) and mixture multilayer latent space model(MMLSM) to generate the multilayer network. We also provide several methods to reveal the embedding of both nodes and layers followed by further data analysis methods, such as clustering. Three real data examples are processed in the package. The source code of rMultiNet is available at https://github.com/ChenyuzZZ73/rMultiNet.
    Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy. (arXiv:2302.04369v1 [cs.LG])
    Despite the recent success of stochastic gradient descent in deep learning, it is often difficult to train a deep neural network with an inappropriate choice of its initial parameters. Even if training is successful, it has been known that the initial parameter configuration may negatively impact generalization. In this paper, we propose an unsupervised algorithm to find good initialization for input data, given that a downstream task is d-way classification. We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification. We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters. We thus design an algorithm that encourages small perturbation to the initial parameter configuration leads to a diverse set of d-way classification tasks. In other words, the proposed algorithm ensures a solution to any downstream task to be near the initial parameter configuration. We empirically evaluate the proposed algorithm on various tasks derived from MNIST with a fully connected network. In these experiments, we observe that our algorithm improves average test accuracy across most of these tasks, and that such improvement is greater when the number of labelled examples is small.
    Knowledge is a Region in Weight Space for Fine-tuned Language Models. (arXiv:2302.04863v1 [cs.LG])
    Research on neural networks has largely focused on understanding a single model trained on a single dataset. However, relatively little is known about the relationships between different models, especially those trained or tested on different datasets. We address this by studying how the weight space and underlying loss landscape of different models are interconnected. Specifically, we demonstrate that fine-tuned models that were optimized for high performance, reside in well-defined regions in weight space, and vice versa -- that any model that resides anywhere in those regions also has high performance. Specifically, we show that language models that have been fine-tuned on the same dataset form a tight cluster in the weight space and that models fine-tuned on different datasets from the same underlying task form a looser cluster. Moreover, traversing around the region between the models reaches new models that perform comparably or even better than models found via fine-tuning, even on tasks that the original models were not fine-tuned on. Our findings provide insight into the relationships between models, demonstrating that a model positioned between two similar models can acquire the knowledge of both. We leverage this finding and design a method to pick a better model for efficient fine-tuning. Specifically, we show that starting from the center of the region is as good or better than the pre-trained model in 11 of 12 datasets and improves accuracy by 3.06 on average.
    Heterogeneous Federated Learning using Dynamic Model Pruning and Adaptive Gradient. (arXiv:2106.06921v2 [cs.LG] UPDATED)
    Federated Learning (FL) has emerged as a new paradigm for training machine learning models distributively without sacrificing data security and privacy. Learning models on edge devices such as mobile phones is one of the most common use cases for FL. However, Non-identical independent distributed~(non-IID) data in edge devices easily leads to training failures. Especially, over-parameterized machine learning models can easily be over-fitted on such data, hence, resulting in inefficient federated learning and poor model performance. To overcome the over-fitting issue, we proposed an adaptive dynamic pruning approach for FL, which can dynamically slim the model by dropping out unimportant parameters, hence, preventing over-fittings. Since the machine learning model's parameters react differently for different training samples, adaptive dynamic pruning will evaluate the salience of the model's parameter according to the input training sample, and only retain the salient parameter's gradients when doing back-propagation. We performed comprehensive experiments to evaluate our approach. The results show that our approach by removing the redundant parameters in neural networks can significantly reduce the over-fitting issue and greatly improves the training efficiency. In particular, when training the ResNet-32 on CIFAR-10, our approach reduces the communication cost by 57\%. We further demonstrate the inference acceleration capability of the proposed algorithm. Our approach reduces up to 50\% FLOPs inference of DNNs on edge devices while maintaining the model's quality.
    Improved Robustness and Safety for Pre-Adaptation of Meta Reinforcement Learning with Prior Regularization. (arXiv:2108.08448v2 [cs.LG] UPDATED)
    Meta Reinforcement Learning (Meta-RL) has seen substantial advancements recently. In particular, off-policy methods were developed to improve the data efficiency of Meta-RL techniques. \textit{Probabilistic embeddings for actor-critic RL} (PEARL) is a leading approach for multi-MDP adaptation problems. A major drawback of many existing Meta-RL methods, including PEARL, is that they do not explicitly consider the safety of the prior policy when it is exposed to a new task for the first time. Safety is essential for many real-world applications, including field robots and Autonomous Vehicles (AVs). In this paper, we develop the PEARL PLUS (PEARL$^+$) algorithm, which optimizes the policy for both prior (pre-adaptation) safety and posterior (after-adaptation) performance. Building on top of PEARL, our proposed PEARL$^+$ algorithm introduces a prior regularization term in the reward function and a new Q-network for recovering the state-action value under prior context assumptions, to improve the robustness to task distribution shift and safety of the trained network exposed to a new task for the first time. The performance of PEARL$^+$ is validated by solving three safety-critical problems related to robots and AVs, including two MuJoCo benchmark problems. From the simulation experiments, we show that safety of the prior policy is significantly improved and more robust to task distribution shift compared to PEARL.
    Projection-free Online Exp-concave Optimization. (arXiv:2302.04859v1 [cs.LG])
    We consider the setting of online convex optimization (OCO) with \textit{exp-concave} losses. The best regret bound known for this setting is $O(n\log{}T)$, where $n$ is the dimension and $T$ is the number of prediction rounds (treating all other quantities as constants and assuming $T$ is sufficiently large), and is attainable via the well-known Online Newton Step algorithm (ONS). However, ONS requires on each iteration to compute a projection (according to some matrix-induced norm) onto the feasible convex set, which is often computationally prohibitive in high-dimensional settings and when the feasible set admits a non-trivial structure. In this work we consider projection-free online algorithms for exp-concave and smooth losses, where by projection-free we refer to algorithms that rely only on the availability of a linear optimization oracle (LOO) for the feasible set, which in many applications of interest admits much more efficient implementations than a projection oracle. We present an LOO-based ONS-style algorithm, which using overall $O(T)$ calls to a LOO, guarantees in worst case regret bounded by $\widetilde{O}(n^{2/3}T^{2/3})$ (ignoring all quantities except for $n,T$). However, our algorithm is most interesting in an important and plausible low-dimensional data scenario: if the gradients (approximately) span a subspace of dimension at most $\rho$, $\rho << n$, the regret bound improves to $\widetilde{O}(\rho^{2/3}T^{2/3})$, and by applying standard deterministic sketching techniques, both the space and average additional per-iteration runtime requirements are only $O(\rho{}n)$ (instead of $O(n^2)$). This improves upon recently proposed LOO-based algorithms for OCO which, while having the same state-of-the-art dependence on the horizon $T$, suffer from regret/oracle complexity that scales with $\sqrt{n}$ or worse.
    A Near-Optimal Algorithm for Safe Reinforcement Learning Under Instantaneous Hard Constraints. (arXiv:2302.04375v1 [cs.LG])
    In many applications of Reinforcement Learning (RL), it is critically important that the algorithm performs safely, such that instantaneous hard constraints are satisfied at each step, and unsafe states and actions are avoided. However, existing algorithms for ''safe'' RL are often designed under constraints that either require expected cumulative costs to be bounded or assume all states are safe. Thus, such algorithms could violate instantaneous hard constraints and traverse unsafe states (and actions) in practice. Therefore, in this paper, we develop the first near-optimal safe RL algorithm for episodic Markov Decision Processes with unsafe states and actions under instantaneous hard constraints and the linear mixture model. It not only achieves a regret $\tilde{O}(\frac{d H^3 \sqrt{dK}}{\Delta_c})$ that tightly matches the state-of-the-art regret in the setting with only unsafe actions and nearly matches that in the unconstrained setting, but is also safe at each step, where $d$ is the feature-mapping dimension, $K$ is the number of episodes, $H$ is the number of steps in each episode, and $\Delta_c$ is a safety-related parameter. We also provide a lower bound $\tilde{\Omega}(\max\{dH \sqrt{K}, \frac{H}{\Delta_c^2}\})$, which indicates that the dependency on $\Delta_c$ is necessary. Further, both our algorithm design and regret analysis involve several novel ideas, which may be of independent interest.
    Lithium Metal Battery Quality Control via Transformer-CNN Segmentation. (arXiv:2302.04824v1 [cs.CV])
    Lithium metal battery (LMB) has the potential to be the next-generation battery system because of their high theoretical energy density. However, defects known as dendrites are formed by heterogeneous lithium (Li) plating, which hinder the development and utilization of LMBs. Non-destructive techniques to observe the dendrite morphology often use computerized X-ray tomography (XCT) imaging to provide cross-sectional views. To retrieve three-dimensional structures inside a battery, image segmentation becomes essential to quantitatively analyze XCT images. This work proposes a new binary semantic segmentation approach using a transformer-based neural network (T-Net) model capable of segmenting out dendrites from XCT data. In addition, we compare the performance of the proposed T-Net with three other algorithms, such as U-Net, Y-Net, and E-Net, consisting of an Ensemble Network model for XCT analysis. Our results show the advantages of using T-Net in terms of object metrics, such as mean Intersection over Union (mIoU) and mean Dice Similarity Coefficient (mDSC) as well as qualitatively through several comparative visualizations.
    Better Diffusion Models Further Improve Adversarial Training. (arXiv:2302.04638v1 [cs.CV])
    It has been recognized that the data generated by the denoising diffusion probabilistic model (DDPM) improves adversarial training. After two years of rapid development in diffusion models, a question naturally arises: can better diffusion models further improve adversarial training? This paper gives an affirmative answer by employing the most recent diffusion model which has higher efficiency ($\sim 20$ sampling steps) and image quality (lower FID score) compared with DDPM. Our adversarially trained models achieve state-of-the-art performance on RobustBench using only generated data (no external datasets). Under the $\ell_\infty$-norm threat model with $\epsilon=8/255$, our models achieve $70.69\%$ and $42.67\%$ robust accuracy on CIFAR-10 and CIFAR-100, respectively, i.e. improving upon previous state-of-the-art models by $+4.58\%$ and $+8.03\%$. Under the $\ell_2$-norm threat model with $\epsilon=128/255$, our models achieve $84.86\%$ on CIFAR-10 ($+4.44\%$). These results also beat previous works that use external data. Our code is available at https://github.com/wzekai99/DM-Improves-AT.
    Symbolic Metamodels for Interpreting Black-boxes Using Primitive Functions. (arXiv:2302.04791v1 [cs.LG])
    One approach for interpreting black-box machine learning models is to find a global approximation of the model using simple interpretable functions, which is called a metamodel (a model of the model). Approximating the black-box with a metamodel can be used to 1) estimate instance-wise feature importance; 2) understand the functional form of the model; 3) analyze feature interactions. In this work, we propose a new method for finding interpretable metamodels. Our approach utilizes Kolmogorov superposition theorem, which expresses multivariate functions as a composition of univariate functions (our primitive parameterized functions). This composition can be represented in the form of a tree. Inspired by symbolic regression, we use a modified form of genetic programming to search over different tree configurations. Gradient descent (GD) is used to optimize the parameters of a given configuration. Our method is a novel memetic algorithm that uses GD not only for training numerical constants but also for the training of building blocks. Using several experiments, we show that our method outperforms recent metamodeling approaches suggested for interpreting black-boxes.
    Learning to Select Pivotal Samples for Meta Re-weighting. (arXiv:2302.04418v1 [cs.LG])
    Sample re-weighting strategies provide a promising mechanism to deal with imperfect training data in machine learning, such as noisily labeled or class-imbalanced data. One such strategy involves formulating a bi-level optimization problem called the meta re-weighting problem, whose goal is to optimize performance on a small set of perfect pivotal samples, called meta samples. Many approaches have been proposed to efficiently solve this problem. However, all of them assume that a perfect meta sample set is already provided while we observe that the selections of meta sample set is performance critical. In this paper, we study how to learn to identify such a meta sample set from a large, imperfect training set, that is subsequently cleaned and used to optimize performance in the meta re-weighting setting. We propose a learning framework which reduces the meta samples selection problem to a weighted K-means clustering problem through rigorously theoretical analysis. We propose two clustering methods within our learning framework, Representation-based clustering method (RBC) and Gradient-based clustering method (GBC), for balancing performance and computational efficiency. Empirical studies demonstrate the performance advantage of our methods over various baseline methods.
    Complex Network for Complex Problems: A comparative study of CNN and Complex-valued CNN. (arXiv:2302.04584v1 [cs.CV])
    Neural networks, especially convolutional neural networks (CNN), are one of the most common tools these days used in computer vision. Most of these networks work with real-valued data using real-valued features. Complex-valued convolutional neural networks (CV-CNN) can preserve the algebraic structure of complex-valued input data and have the potential to learn more complex relationships between the input and the ground-truth. Although some comparisons of CNNs and CV-CNNs for different tasks have been performed in the past, a large-scale investigation comparing different models operating on different tasks has not been conducted. Furthermore, because complex features contain both real and imaginary components, CV-CNNs have double the number of trainable parameters as real-valued CNNs in terms of the actual number of trainable parameters. Whether or not the improvements in performance with CV-CNN observed in the past have been because of the complex features or just because of having double the number of trainable parameters has not yet been explored. This paper presents a comparative study of CNN, CNNx2 (CNN with double the number of trainable parameters as the CNN), and CV-CNN. The experiments were performed using seven models for two different tasks - brain tumour classification and segmentation in brain MRIs. The results have revealed that the CV-CNN models outperformed the CNN and CNNx2 models.
    Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples. (arXiv:2302.04578v1 [cs.CV])
    Diffusion Models (DMs) achieve state-of-the-art performance in generative tasks, boosting a wave in AI for Art. Despite the success of commercialization, DMs meanwhile provide tools for copyright violations, where infringers benefit from illegally using paintings created by human artists to train DMs and generate novel paintings in a similar style. In this paper, we show that it is possible to create an image $x'$ that is similar to an image $x$ for human vision but unrecognizable for DMs. We build a framework to define and evaluate this adversarial example for diffusion models. Based on the framework, we further propose AdvDM, an algorithm to generate adversarial examples for DMs. By optimizing upon different latent variables sampled from the reverse process of DMs, AdvDM conducts a Monte-Carlo estimation of adversarial examples for DMs. Extensive experiments show that the estimated adversarial examples can effectively hinder DMs from extracting their features. Our method can be a powerful tool for human artists to protect their copyright against infringers with DM-based AI-for-Art applications.
    Q-Diffusion: Quantizing Diffusion Models. (arXiv:2302.04304v1 [cs.CV])
    Diffusion models have recently achieved great success in synthesizing diverse and high-fidelity images. However, sampling speed and memory constraints remain a major barrier to the practical adoption of diffusion models as the generation process for these models can be slow due to the need for iterative noise estimation using complex neural networks. We propose a solution to this problem by compressing the noise estimation network to accelerate the generation process using post-training quantization (PTQ). While existing PTQ approaches have not been able to effectively deal with the changing output distributions of noise estimation networks in diffusion models over multiple time steps, we are able to formulate a PTQ method that is specifically designed to handle the unique multi-timestep structure of diffusion models with a data calibration scheme using data sampled from different time steps. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a FID change of at most 1.88. Our approach can also be applied to text-guided image generation, and for the first time we can run stable diffusion in 4-bit weights without losing much perceptual quality, as shown in Figure 5 and Figure 9.
    Scalable Task-Driven Robotic Swarm Control via Collision Avoidance and Learning Mean-Field Control. (arXiv:2209.07420v3 [cs.RO] UPDATED)
    In recent years, reinforcement learning and its multi-agent analogue have achieved great success in solving various complex control problems. However, multi-agent reinforcement learning remains challenging both in its theoretical analysis and empirical design of algorithms, especially for large swarms of embodied robotic agents where a definitive toolchain remains part of active research. We use emerging state-of-the-art mean-field control techniques in order to convert many-agent swarm control into more classical single-agent control of distributions. This allows profiting from advances in single-agent reinforcement learning at the cost of assuming weak interaction between agents. However, the mean-field model is violated by the nature of real systems with embodied, physically colliding agents. Thus, we combine collision avoidance and learning of mean-field control into a unified framework for tractably designing intelligent robotic swarm behavior. On the theoretical side, we provide novel approximation guarantees for general mean-field control both in continuous spaces and with collision avoidance. On the practical side, we show that our approach outperforms multi-agent reinforcement learning and allows for decentralized open-loop application while avoiding collisions, both in simulation and real UAV swarms. Overall, we propose a framework for the design of swarm behavior that is both mathematically well-founded and practically useful, enabling the solution of otherwise intractable swarm problems.
    Offsite-Tuning: Transfer Learning without Full Model. (arXiv:2302.04870v1 [cs.CL])
    Transfer learning is important for foundation models to adapt to downstream tasks. However, many foundation models are proprietary, so users must share their data with model owners to fine-tune the models, which is costly and raise privacy concerns. Moreover, fine-tuning large foundation models is computation-intensive and impractical for most downstream users. In this paper, we propose Offsite-Tuning, a privacy-preserving and efficient transfer learning framework that can adapt billion-parameter foundation models to downstream data without access to the full model. In offsite-tuning, the model owner sends a light-weight adapter and a lossy compressed emulator to the data owner, who then fine-tunes the adapter on the downstream data with the emulator's assistance. The fine-tuned adapter is then returned to the model owner, who plugs it into the full model to create an adapted foundation model. Offsite-tuning preserves both parties' privacy and is computationally more efficient than the existing fine-tuning methods that require access to the full model weights. We demonstrate the effectiveness of offsite-tuning on various large language and vision foundation models. Offsite-tuning can achieve comparable accuracy as full model fine-tuning while being privacy-preserving and efficient, achieving 6.5x speedup and 5.6x memory reduction. Code is available at https://github.com/mit-han-lab/offsite-tuning.
    Near-Optimal Adversarial Reinforcement Learning with Switching Costs. (arXiv:2302.04374v1 [cs.LG])
    Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $\beta$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
    Mind the Gap: Measuring Generalization Performance Across Multiple Objectives. (arXiv:2212.04183v2 [cs.LG] UPDATED)
    Modern machine learning models are often constructed taking into account multiple objectives, e.g., minimizing inference time while also maximizing accuracy. Multi-objective hyperparameter optimization (MHPO) algorithms return such candidate models, and the approximation of the Pareto front is used to assess their performance. In practice, we also want to measure generalization when moving from the validation to the test set. However, some of the models might no longer be Pareto-optimal which makes it unclear how to quantify the performance of the MHPO method when evaluated on the test set. To resolve this, we provide a novel evaluation protocol that allows measuring the generalization performance of MHPO methods and studying its capabilities for comparing two optimization experiments.
    Performative Recommendation: Diversifying Content via Strategic Incentives. (arXiv:2302.04336v1 [cs.LG])
    The primary goal in recommendation is to suggest relevant content to users, but optimizing for accuracy often results in recommendations that lack diversity. To remedy this, conventional approaches such as re-ranking improve diversity by presenting more diverse items. Here we argue that to promote inherent and prolonged diversity, the system must encourage its creation. Towards this, we harness the performative nature of recommendation, and show how learning can incentivize strategic content creators to create diverse content. Our approach relies on a novel form of regularization that anticipates strategic changes to content, and penalizes for content homogeneity. We provide analytic and empirical results that demonstrate when and how diversity can be incentivized, and experimentally demonstrate the utility of our approach on synthetic and semi-synthetic data.  ( 2 min )
  • Open

    CausalEGM: a general causal inference framework by encoding generative modeling. (arXiv:2212.05925v3 [stat.ML] UPDATED)
    Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM.
    rMultiNet: An R Package For Multilayer Networks Analysis. (arXiv:2302.04437v1 [stat.ML])
    This paper develops an R package rMultiNet to analyze multilayer network data. We provide two general frameworks from recent literature, e.g. mixture multilayer stochastic block model(MMSBM) and mixture multilayer latent space model(MMLSM) to generate the multilayer network. We also provide several methods to reveal the embedding of both nodes and layers followed by further data analysis methods, such as clustering. Three real data examples are processed in the package. The source code of rMultiNet is available at https://github.com/ChenyuzZZ73/rMultiNet.
    Flaky Performances when Pretraining on Relational Databases. (arXiv:2211.05213v1 [cs.LG] CROSS LISTED)
    We explore the downstream task performances for graph neural network (GNN) self-supervised learning (SSL) methods trained on subgraphs extracted from relational databases (RDBs). Intuitively, this joint use of SSL and GNNs should allow to leverage more of the available data, which could translate to better results. However, we found that naively porting contrastive SSL techniques can cause ``negative transfer'': linear evaluation on fixed representations from a pretrained model performs worse than on representations from the randomly-initialized model. Based on the conjecture that contrastive SSL conflicts with the message passing layers of the GNN, we propose InfoNode: a contrastive loss aiming to maximize the mutual information between a node's initial- and final-layer representation. The primary empirical results support our conjecture and the effectiveness of InfoNode.
    Trading Information between Latents in Hierarchical Variational Autoencoders. (arXiv:2302.04855v1 [stat.ML])
    Variational Autoencoders (VAEs) were originally motivated (Kingma & Welling, 2014) as probabilistic generative models in which one performs approximate Bayesian inference. The proposal of $\beta$-VAEs (Higgins et al., 2017) breaks this interpretation and generalizes VAEs to application domains beyond generative modeling (e.g., representation learning, clustering, or lossy data compression) by introducing an objective function that allows practitioners to trade off between the information content ("bit rate") of the latent representation and the distortion of reconstructed data (Alemi et al., 2018). In this paper, we reconsider this rate/distortion trade-off in the context of hierarchical VAEs, i.e., VAEs with more than one layer of latent variables. We identify a general class of inference models for which one can split the rate into contributions from each layer, which can then be tuned independently. We derive theoretical bounds on the performance of downstream tasks as functions of the individual layers' rates and verify our theoretical findings in large-scale experiments. Our results provide guidance for practitioners on which region in rate-space to target for a given application.
    Random survival forests with multivariate longitudinal endogenous covariates. (arXiv:2208.05801v2 [stat.ML] UPDATED)
    Predicting the individual risk of a clinical event using the complete patient history is still a major challenge for personalized medicine. Among the methods developed to compute individual dynamic predictions, the joint models have the assets of using all the available information while accounting for dropout. However, they are restricted to a very small number of longitudinal predictors. Our objective was to propose an innovative alternative solution to predict an event probability using a possibly large number of longitudinal predictors. We developed DynForest, an extension of competing-risk random survival forests that handles endogenous longitudinal predictors. At each node of the tree, the time-dependent predictors are translated into time-fixed features (using mixed models) to be used as candidates for splitting the subjects into two subgroups. The individual event probability is estimated in each tree by the Aalen-Johansen estimator of the leaf in which the subject is classified according to his/her history of predictors. The final individual prediction is given by the average of the tree-specific individual event probabilities. We carried out a simulation study to demonstrate the performances of DynForest both in a small dimensional context (in comparison with joint models) and in a large dimensional context (in comparison with a regression calibration method that ignores informative dropout). We also applied DynForest to (i) predict the individual probability of dementia in the elderly according to repeated measures of cognitive, functional, vascular and neuro-degeneration markers, and (ii) quantify the importance of each type of markers for the prediction of dementia. Implemented in the R package DynForest, our methodology provides a novel and appropriate solution for the prediction of events from any number of longitudinal endogenous predictors.
    Fully Bayesian Autoencoders with Latent Sparse Gaussian Processes. (arXiv:2302.04534v1 [cs.LG])
    Autoencoders and their variants are among the most widely used models in representation learning and generative modeling. However, autoencoder-based models usually assume that the learned representations are i.i.d. and fail to capture the correlations between the data samples. To address this issue, we propose a novel Sparse Gaussian Process Bayesian Autoencoder (SGPBAE) model in which we impose fully Bayesian sparse Gaussian Process priors on the latent space of a Bayesian Autoencoder. We perform posterior estimation for this model via stochastic gradient Hamiltonian Monte Carlo. We evaluate our approach qualitatively and quantitatively on a wide range of representation learning and generative modeling tasks and show that our approach consistently outperforms multiple alternatives relying on Variational Autoencoders.
    A Benchmark on Uncertainty Quantification for Deep Learning Prognostics. (arXiv:2302.04730v1 [cs.LG])
    Reliable uncertainty quantification on RUL prediction is crucial for informative decision-making in predictive maintenance. In this context, we assess some of the latest developments in the field of uncertainty quantification for prognostics deep learning. This includes the state-of-the-art variational inference algorithms for Bayesian neural networks (BNN) as well as popular alternatives such as Monte Carlo Dropout (MCD), deep ensembles (DE) and heteroscedastic neural networks (HNN). All the inference techniques share the same inception deep learning architecture as a functional model. We performed hyperparameter search to optimize the main variational and learning parameters of the algorithms. The performance of the methods is evaluated on a subset of the large NASA NCMAPSS dataset for aircraft engines. The assessment includes RUL prediction accuracy, the quality of predictive uncertainty, and the possibility to break down the total predictive uncertainty into its aleatoric and epistemic parts. The results show no method clearly outperforms the others in all the situations. Although all methods are close in terms of accuracy, we find differences in the way they estimate uncertainty. Thus, DE and MCD generally provide more conservative predictive uncertainty than BNN. Surprisingly, HNN can achieve strong results without the added training complexity and extra parameters of the BNN. For tasks like active learning where a separation of epistemic and aleatoric uncertainty is required, radial BNN and MCD seem the best options.
    How degenerate is the parametrization of neural networks with the ReLU activation function?. (arXiv:1905.09803v3 [cs.LG] UPDATED)
    Neural network training is usually accomplished by solving a non-convex optimization problem using stochastic gradient descent. Although one optimizes over the networks parameters, the main loss function generally only depends on the realization of the neural network, i.e. the function it computes. Studying the optimization problem over the space of realizations opens up new ways to understand neural network training. In particular, usual loss functions like mean squared error and categorical cross entropy are convex on spaces of neural network realizations, which themselves are non-convex. Approximation capabilities of neural networks can be used to deal with the latter non-convexity, which allows us to establish that for sufficiently large networks local minima of a regularized optimization problem on the realization space are almost optimal. Note, however, that each realization has many different, possibly degenerate, parametrizations. In particular, a local minimum in the parametrization space needs not correspond to a local minimum in the realization space. To establish such a connection, inverse stability of the realization map is required, meaning that proximity of realizations must imply proximity of corresponding parametrizations. We present pathologies which prevent inverse stability in general, and, for shallow networks, proceed to establish a restricted space of parametrizations on which we have inverse stability w.r.t. to a Sobolev norm. Furthermore, we show that by optimizing over such restricted sets, it is still possible to learn any function which can be learned by optimization over unrestricted sets.
    Continual Causal Effect Estimation: Challenges and Opportunities. (arXiv:2301.01026v3 [cs.LG] UPDATED)
    A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.
    On Sampling with Approximate Transport Maps. (arXiv:2302.04763v1 [stat.ML])
    Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can reliability be handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of target-proposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.
    Domain Generalization by Functional Regression. (arXiv:2302.04724v1 [cs.LG])
    The problem of domain generalization is to learn, given data from different source distributions, a model that can be expected to generalize well on new target distributions which are only seen through unlabeled samples. In this paper, we study domain generalization as a problem of functional regression. Our concept leads to a new algorithm for learning a linear operator from marginal distributions of inputs to the corresponding conditional distributions of outputs given inputs. Our algorithm allows a source distribution-dependent construction of reproducing kernel Hilbert spaces for prediction, and, satisfies finite sample error bounds for the idealized risk. Numerical implementations and source code are available.
    Near-Optimal Adversarial Reinforcement Learning with Switching Costs. (arXiv:2302.04374v1 [cs.LG])
    Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $\beta$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
    Geometry-Complete Diffusion for 3D Molecule Generation. (arXiv:2302.04313v1 [cs.LG])
    Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods such as those of Hoogeboom et al. 2022 have been proposed for unconditionally generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. Toward this end, we propose GCDM, a geometry-complete diffusion model that achieves new state-of-the-art results for 3D molecule diffusion generation by leveraging the representation learning strengths offered by GNNs that perform geometry-complete message-passing. Our results with GCDM also offer preliminary insights into how physical inductive biases impact the generative dynamics of molecular DDPMs. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/bio-diffusion.
    Efficient displacement convex optimization with particle gradient descent. (arXiv:2302.04753v1 [cs.LG])
    Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Concretely, for Lipschitz displacement convex functions defined on probability over $\mathbb{R}^d$, we prove that $O(1/\epsilon^2)$ particles and $O(d/\epsilon^4)$ computations are sufficient to find the $\epsilon$-optimal solutions. We further provide improved complexity bounds for optimizing smooth displacement convex functions. We demonstrate the application of our results for function approximation with specific neural architectures with two-dimensional inputs.
    The Sample Complexity of Approximate Rejection Sampling with Applications to Smoothed Online Learning. (arXiv:2302.04658v1 [stat.ML])
    Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the output distributed as close as possible to a target distribution $\nu$. In this work we show that the optimal total variation distance as a function of $n$ is given by $\tilde\Theta(\frac{D}{f'(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in the seemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithms still hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniform over a function class and compare importance sampling with rejection sampling.
    A data variation robust learning model based on importance sampling. (arXiv:2302.04438v1 [stat.ML])
    A crucial assumption underlying the most current theory of machine learning is that the training distribution is identical to the testing distribution. However, this assumption may not hold in some real-world applications. In this paper, we propose an importance sampling based data variation robust loss (ISloss) for learning problems which minimizes the worst case of loss under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from the importance sampling method. Furthermore, we reveal that there is a relationship between ISloss under the logarithmic transformation (LogISloss) and the p-norm loss. We apply the proposed LogISloss to the face verification problem on Racial Faces in the Wild dataset and show that the proposed method is robust under large distribution deviations.
    Sparse Random Networks for Communication-Efficient Federated Learning. (arXiv:2209.15328v2 [cs.LG] UPDATED)
    One main challenge in federated learning is the large communication cost of exchanging weight updates from clients to the server at each round. While prior work has made great progress in compressing the weight updates through gradient compression methods, we propose a radically different approach that does not update the weights at all. Instead, our method freezes the weights at their initial \emph{random} values and learns how to sparsify the random network for the best performance. To this end, the clients collaborate in training a \emph{stochastic} binary mask to find the optimal sparse random network within the original one. At the end of the training, the final model is a sparse network with random weights -- or a subnetwork inside the dense random network. We show improvements in accuracy, communication (less than $1$ bit per parameter (bpp)), convergence speed, and final model size (less than $1$ bpp) over relevant baselines on MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets, in the low bitrate regime under various system configurations.
    Efficient Planning in Combinatorial Action Spaces with Applications to Cooperative Multi-Agent Reinforcement Learning. (arXiv:2302.04376v1 [cs.LG])
    A practical challenge in reinforcement learning are combinatorial action spaces that make planning computationally demanding. For example, in cooperative multi-agent reinforcement learning, a potentially large number of agents jointly optimize a global reward function, which leads to a combinatorial blow-up in the action space by the number of agents. As a minimal requirement, we assume access to an argmax oracle that allows to efficiently compute the greedy policy for any Q-function in the model class. Building on recent work in planning with local access to a simulator and linear function approximation, we propose efficient algorithms for this setting that lead to polynomial compute and query complexity in all relevant problem parameters. For the special case where the feature decomposition is additive, we further improve the bounds and extend the results to the kernelized setting with an efficient algorithm.
    Local Lipschitz Bounds of Deep Neural Networks. (arXiv:2004.13135v3 [stat.ML] UPDATED)
    The Lipschitz constant is an important quantity that arises in analysing the convergence of gradient-based optimization methods. It is generally unclear how to estimate the Lipschitz constant of a complex model. Thus, this paper studies an important problem that may be useful to the broader area of non-convex optimization. The main result provides a local upper bound on the Lipschitz constants of a multi-layer feed-forward neural network and its gradient. Moreover, lower bounds are established as well, which are used to show that it is impossible to derive global upper bounds for the Lipschitz constants. In contrast to previous works, we compute the Lipschitz constants with respect to the network parameters and not with respect to the inputs. These constants are needed for the theoretical description of many step size schedulers of gradient based optimization schemes and their convergence analysis. The idea is both simple and effective. The results are extended to a generalization of neural networks, continuously deep neural networks, which are described by controlled ODEs.
    Conformal Off-policy Prediction. (arXiv:2206.06711v2 [stat.ML] UPDATED)
    Off-policy evaluation is critical in a number of applications where new policies need to be evaluated offline before online deployment. Most existing methods focus on the expected return, define the target parameter through averaging and provide a point estimator only. In this paper, we develop a novel procedure to produce reliable interval estimators for a target policy's return starting from any initial state. Our proposal accounts for the variability of the return around its expectation, focuses on the individual effect and offers valid uncertainty quantification. Our main idea lies in designing a pseudo policy that generates subsamples as if they were sampled from the target policy so that existing conformal prediction algorithms are applicable to prediction interval construction. Our methods are justified by theories, synthetic data and real data from short-video platforms.
    A Text-guided Protein Design Framework. (arXiv:2302.04611v1 [cs.LG])
    Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level properties. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP that aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that generates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We empirically verify the effectiveness of ProteinDT from three aspects: (1) consistently superior performance on four out of six protein property prediction benchmarks; (2) over 90% accuracy for text-guided protein generation; and (3) promising results for zero-shot text-guided protein editing.
    Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization. (arXiv:2302.04552v1 [cs.LG])
    Stochastically Extended Adversarial (SEA) model is introduced by Sachs et al. [2022] as an interpolation between stochastic and adversarial online convex optimization. Under the smoothness condition, they demonstrate that the expected regret of optimistic follow-the-regularized-leader (FTRL) depends on the cumulative stochastic variance $\sigma_{1:T}^2$ and the cumulative adversarial variation $\Sigma_{1:T}^2$ for convex functions. They also provide a slightly weaker bound based on the maximal stochastic variance $\sigma_{\max}^2$ and the maximal adversarial variation $\Sigma_{\max}^2$ for strongly convex functions. Inspired by their work, we investigate the theoretical guarantees of optimistic online mirror descent (OMD) for the SEA model. For convex and smooth functions, we obtain the same $\mathcal{O}(\sqrt{\sigma_{1:T}^2}+\sqrt{\Sigma_{1:T}^2})$ regret bound, without the convexity requirement of individual functions. For strongly convex and smooth functions, we establish an $\mathcal{O}(\min\{\log (\sigma_{1:T}^2+\Sigma_{1:T}^2), (\sigma_{\max}^2 + \Sigma_{\max}^2) \log T\})$ bound, better than their $\mathcal{O}((\sigma_{\max}^2 + \Sigma_{\max}^2) \log T)$ bound. For \mbox{exp-concave} and smooth functions, we achieve a new $\mathcal{O}(d\log(\sigma_{1:T}^2+\Sigma_{1:T}^2))$ bound. Owing to the OMD framework, we can further extend our result to obtain dynamic regret guarantees, which are more favorable in non-stationary online scenarios. The attained results allow us to recover excess risk bounds of the stochastic setting and regret bounds of the adversarial setting, and derive new guarantees for many intermediate scenarios.
    Classification of BCI-EEG based on augmented covariance matrix. (arXiv:2302.04508v1 [cs.HC])
    Objective: Electroencephalography signals are recorded as a multidimensional dataset. We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification. Methods: From the autoregressive model can be derived the Yule-Walker equations, which show the emergence of a symmetric positive definite matrix: the augmented covariance matrix. The state-of the art for classifying covariance matrices is based on Riemannian Geometry. A fairly natural idea is therefore to extend the standard approach using these augmented covariance matrices. The methodology for creating the augmented covariance matrix shows a natural connection with the delay embedding theorem proposed by Takens for dynamical systems. Such an embedding method is based on the knowledge of two parameters: the delay and the embedding dimension, respectively related to the lag and the order of the autoregressive model. This approach provides new methods to compute the hyper-parameters in addition to standard grid search. Results: The augmented covariance matrix performed noticeably better than any state-of-the-art methods. We will test our approach on several datasets and several subjects using the MOABB framework, using both within-session and cross-session evaluation. Conclusion: The improvement in results is due to the fact that the augmented covariance matrix incorporates not only spatial but also temporal information, incorporating nonlinear components of the signal through an embedding procedure, which allows the leveraging of dynamical systems algorithms. Significance: These results extend the concepts and the results of the Riemannian distance based classification algorithm.
    Convergence of a robust deep FBSDE method for stochastic control. (arXiv:2201.06854v5 [math.OC] UPDATED)
    In this paper, we propose a deep learning based numerical scheme for strongly coupled FBSDEs, stemming from stochastic control. It is a modification of the deep BSDE method in which the initial value to the backward equation is not a free parameter, and with a new loss function being the weighted sum of the cost of the control problem, and a variance term which coincides with the mean squared error in the terminal condition. We show by a numerical example that a direct extension of the classical deep BSDE method to FBSDEs, fails for a simple linear-quadratic control problem, and motivate why the new method works. Under regularity and boundedness assumptions on the exact controls of time continuous and time discrete control problems, we provide an error analysis for our method. We show empirically that the method converges for three different problems, one being the one that failed for a direct extension of the deep BSDE method.
    Private Quantiles Estimation in the Presence of Atoms. (arXiv:2202.08969v2 [stat.ML] UPDATED)
    We consider the differentially private estimation of multiple quantiles (MQ) of a distribution from a dataset, a key building block in modern data analysis. We apply the recent non-smoothed Inverse Sensitivity (IS) mechanism to this specific problem. We establish that the resulting method is closely related to the recently published ad hoc algorithm JointExp. In particular, they share the same computational complexity and a similar efficiency. We prove the statistical consistency of these two algorithms for continuous distributions. Furthermore, we demonstrate both theoretically and empirically that this method suffers from an important lack of performance in the case of peaked distributions, which can degrade up to a potentially catastrophic impact in the presence of atoms. Its smoothed version (i.e. by applying a max kernel to its output density) would solve this problem, but remains an open challenge to implement. As a proxy, we propose a simple and numerically efficient method called Heuristically Smoothed JointExp (HSJointExp), which is endowed with performance guarantees for a broad class of distributions and achieves results that are orders of magnitude better on problematic datasets.  ( 2 min )
    Learning Dynamical Systems by Leveraging Data from Similar Systems. (arXiv:2302.04344v1 [stat.ML])
    We consider the problem of learning the dynamics of a linear system when one has access to data generated by an auxiliary system that shares similar (but not identical) dynamics, in addition to data from the true system. We use a weighted least squares approach, and provide a finite sample error bound of the learned model as a function of the number of samples and various system parameters from the two systems as well as the weight assigned to the auxiliary data. We show that the auxiliary data can help to reduce the intrinsic system identification error due to noise, at the price of adding a portion of error that is due to the differences between the two system models. We further provide a data-dependent bound that is computable when some prior knowledge about the systems is available. This bound can also be used to determine the weight that should be assigned to the auxiliary data during the model training stage.  ( 2 min )
    Equivariant MuZero. (arXiv:2302.04798v1 [cs.LG])
    Deep reinforcement learning repeatedly succeeds in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms, such as the highly successful MuZero, aim to accomplish this by learning a world model. However, leveraging a world model has not consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.  ( 2 min )
    Meta-ticket: Finding optimal subnetworks for few-shot learning within randomly initialized neural networks. (arXiv:2205.15619v2 [cs.LG] UPDATED)
    Few-shot learning for neural networks (NNs) is an important problem that aims to train NNs with a few data. The main challenge is how to avoid overfitting since over-parameterized NNs can easily overfit to such small dataset. Previous work (e.g. MAML by Finn et al. 2017) tackles this challenge by meta-learning, which learns how to learn from a few data by using various tasks. On the other hand, one conventional approach to avoid overfitting is restricting hypothesis spaces by endowing sparse NN structures like convolution layers in computer vision. However, although such manually-designed sparse structures are sample-efficient for sufficiently large datasets, they are still insufficient for few-shot learning. Then the following questions naturally arise: (1) Can we find sparse structures effective for few-shot learning by meta-learning? (2) What benefits will it bring in terms of meta-generalization? In this work, we propose a novel meta-learning approach, called Meta-ticket, to find optimal sparse subnetworks for few-shot learning within randomly initialized NNs. We empirically validated that Meta-ticket successfully discover sparse subnetworks that can learn specialized features for each given task. Due to this task-wise adaptation ability, Meta-ticket achieves superior meta-generalization compared to MAML-based methods especially with large NNs. The code is available at: https://github.com/dchiji-ntt/meta-ticket  ( 2 min )
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v8 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-arts.  ( 2 min )
    Lazy OCO: Online Convex Optimization on a Switching Budget. (arXiv:2102.03803v5 [cs.LG] UPDATED)
    We study a variant of online convex optimization where the player is permitted to switch decisions at most $S$ times in expectation throughout $T$ rounds. Similar problems have been addressed in prior work for the discrete decision set setting, and more recently in the continuous setting but only with an adaptive adversary. In this work, we aim to fill the gap and present computationally efficient algorithms in the more prevalent oblivious setting, establishing a regret bound of $O(T/S)$ for general convex losses and $\widetilde O(T/S^2)$ for strongly convex losses. In addition, for stochastic i.i.d.~losses, we present a simple algorithm that performs $\log T$ switches with only a multiplicative $\log T$ factor overhead in its regret in both the general and strongly convex settings. Finally, we complement our algorithms with lower bounds that match our upper bounds in some of the cases we consider.  ( 2 min )
    A Constant-per-Iteration Likelihood Ratio Test for Online Changepoint Detection for Exponential Family Models. (arXiv:2302.04743v1 [stat.CO])
    Online changepoint detection algorithms that are based on likelihood-ratio tests have been shown to have excellent statistical properties. However, a simple online implementation is computationally infeasible as, at time $T$, it involves considering $O(T)$ possible locations for the change. Recently, the FOCuS algorithm has been introduced for detecting changes in mean in Gaussian data that decreases the per-iteration cost to $O(\log T)$. This is possible by using pruning ideas, which reduce the set of changepoint locations that need to be considered at time $T$ to approximately $\log T$. We show that if one wishes to perform the likelihood ratio test for a different one-parameter exponential family model, then exactly the same pruning rule can be used, and again one need only consider approximately $\log T$ locations at iteration $T$. Furthermore, we show how we can adaptively perform the maximisation step of the algorithm so that we need only maximise the test statistic over a small subset of these possible locations. Empirical results show that the resulting online algorithm, which can detect changes under a wide range of models, has a constant-per-iteration cost on average.  ( 2 min )
    On Computable Online Learning. (arXiv:2302.04357v1 [cs.LG])
    We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for "optimality" in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of "optimality" and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.  ( 2 min )
    Find a witness or shatter: the landscape of computable PAC learning. (arXiv:2302.04731v1 [cs.CC])
    This paper contributes to the study of CPAC learnability -- a computable version of PAC learning -- by solving three open questions from recent papers. Firstly, we prove that every improperly CPAC learnable class is contained in a class which is properly CPAC learnable with polynomial sample complexity. This confirms a conjecture by Agarwal et al (COLT 2021). Secondly, we show that there exists a decidable class of hypothesis which is properly CPAC learnable, but only with uncomputably fast growing sample complexity. This solves a question from Sterkenburg (COLT 2022). Finally, we construct a decidable class of finite Littlestone dimension which is not improperly CPAC learnable, strengthening a recent result of Sterkenburg (2022) and answering a question posed by Hasrati and Ben-David (ALT 2023). Together with previous work, our results provide a complete landscape for the learnability problem in the CPAC setting.  ( 2 min )
    Importance Sampling Deterministic Annealing for Clustering. (arXiv:2302.04421v1 [stat.ML])
    A current assumption of most clustering methods is that the training data and future data are taken from the same distribution. However, this assumption may not hold in some real-world scenarios. In this paper, we propose an importance sampling based deterministic annealing approach (ISDA) for clustering problems which minimizes the worst case of expected distortions under the constraint of distribution deviation. The distribution deviation constraint can be converted to the constraint over a set of weight distributions centered on the uniform distribution derived from importance sampling. The objective of the proposed approach is to minimize the loss under maximum degradation hence the resulting problem is a constrained minimax optimization problem which can be reformulated to an unconstrained problem using the Lagrange method and be solved by the quasi-newton algorithm. Experiment results on synthetic datasets and a real-world load forecasting problem validate the effectiveness of the proposed ISDA. Furthermore, we show that fuzzy c-means is a special case of ISDA with the logarithmic distortion. This observation sheds a new light on the relationship between fuzzy c-means and deterministic annealing clustering algorithms and provides an interesting physical and information-theoretical interpretation for fuzzy exponent $m$.  ( 2 min )
    Disentangling Learning Representations with Density Estimation. (arXiv:2302.04362v1 [cs.LG])
    Disentangled learning representations have promising utility in many applications, but they currently suffer from serious reliability issues. We present Gaussian Channel Autoencoder (GCAE), a method which achieves reliable disentanglement via flexible density estimation of the latent space. GCAE avoids the curse of dimensionality of density estimation by disentangling subsets of its latent space with the Dual Total Correlation (DTC) metric, thereby representing its high-dimensional latent joint distribution as a collection of many low-dimensional conditional distributions. In our experiments, GCAE achieves highly competitive and reliable disentanglement scores compared with state-of-the-art baselines.  ( 2 min )
    Introduction To Gaussian Process Regression In Bayesian Inverse Problems, With New ResultsOn Experimental Design For Weighted Error Measures. (arXiv:2302.04518v1 [stat.ML])
    Bayesian posterior distributions arising in modern applications, including inverse problems in partial differential equation models in tomography and subsurface flow, are often computationally intractable due to the large computational cost of evaluating the data likelihood. To alleviate this problem, we consider using Gaussian process regression to build a surrogate model for the likelihood, resulting in an approximate posterior distribution that is amenable to computations in practice. This work serves as an introduction to Gaussian process regression, in particular in the context of building surrogate models for inverse problems, and presents new insights into a suitable choice of training points. We show that the error between the true and approximate posterior distribution can be bounded by the error between the true and approximate likelihood, measured in the $L^2$-norm weighted by the true posterior, and that efficiently bounding the error between the true and approximate likelihood in this norm suggests choosing the training points in the Gaussian process surrogate model based on the true posterior.  ( 2 min )
    Generalization in Graph Neural Networks: Improved PAC-Bayesian Bounds on Graph Diffusion. (arXiv:2302.04451v1 [cs.LG])
    Graph neural networks are widely used tools for graph prediction tasks. Motivated by their empirical performance, prior works have developed generalization bounds for graph neural networks, which scale with graph structures in terms of the maximum degree. In this paper, we present generalization bounds that instead scale with the largest singular value of the graph neural network's feature diffusion matrix. These bounds are numerically much smaller than prior bounds for real-world graphs. We also construct a lower bound of the generalization gap that matches our upper bound asymptotically. To achieve these results, we analyze a unified model that includes prior works' settings (i.e., convolutional and message-passing networks) and new settings (i.e., graph isomorphism networks). Our key idea is to measure the stability of graph neural networks against noise perturbations using Hessians. Empirically, we find that Hessian-based measurements correlate with the observed generalization gaps of graph neural networks accurately; Optimizing noise stability properties for fine-tuning pretrained graph neural networks also improves test performance on several graph-level classification tasks.  ( 2 min )
    Outlier-Robust Gromov Wasserstein for Graph Data. (arXiv:2302.04610v1 [cs.LG])
    Gromov Wasserstein (GW) distance is a powerful tool for comparing and aligning probability distributions supported on different metric spaces. It has become the main modeling technique for aligning heterogeneous data for a wide range of graph learning tasks. However, the GW distance is known to be highly sensitive to outliers, which can result in large inaccuracies if the outliers are given the same weight as other samples in the objective function. To mitigate this issue, we introduce a new and robust version of the GW distance called RGW. RGW features optimistically perturbed marginal constraints within a $\varphi$-divergence based ambiguity set. To make the benefits of RGW more accessible in practice, we develop a computationally efficient algorithm, Bregman proximal alternating linearization minimization, with a theoretical convergence guarantee. Through extensive experimentation, we validate our theoretical results and demonstrate the effectiveness of RGW on real-world graph learning tasks, such as subgraph matching and partial shape correspondence.  ( 2 min )
    Robust and Scalable Bayesian Online Changepoint Detection. (arXiv:2302.04759v1 [stat.ML])
    This paper proposes an online, provably robust, and scalable Bayesian approach for changepoint detection. The resulting algorithm has key advantages over previous work: it provides provable robustness by leveraging the generalised Bayesian perspective, and also addresses the scalability issues of previous attempts. Specifically, the proposed generalised Bayesian formalism leads to conjugate posteriors whose parameters are available in closed form by leveraging diffusion score matching. The resulting algorithm is exact, can be updated through simple algebra, and is more than 10 times faster than its closest competitor.  ( 2 min )
    Discovering interpretable Lagrangian of dynamical systems from data. (arXiv:2302.04400v1 [stat.ML])
    A complete understanding of physical systems requires models that are accurate and obeys natural conservation laws. Recent trends in representation learning involve learning Lagrangian from data rather than the direct discovery of governing equations of motion. The generalization of equation discovery techniques has huge potential; however, existing Lagrangian discovery frameworks are black-box in nature. This raises a concern about the reusability of the discovered Lagrangian. In this article, we propose a novel data-driven machine-learning algorithm to automate the discovery of interpretable Lagrangian from data. The Lagrangian are derived in interpretable forms, which also allows the automated discovery of conservation laws and governing equations of motion. The architecture of the proposed framework is designed in such a way that it allows learning the Lagrangian from a subset of the underlying domain and then generalizing for an infinite-dimensional system. The fidelity of the proposed framework is exemplified using examples described by systems of ordinary differential equations and partial differential equations where the Lagrangian and conserved quantities are known.  ( 2 min )
    InfoNCE is a variational autoencoder. (arXiv:2107.02495v2 [stat.ML] UPDATED)
    There are two main approaches to self-supervised learning (SSL), generative SSL, which learns a probabilistic model of the inputs, or contrastive SSL where we design a supervised learning task to encourage good representations. We reconcile these approaches by showing that contrastive SSL methods (including InfoNCE) which maximize the mutual information (MI) implicitly learn a probabilistic model of the inputs (specifically, a variational autoencoder; VAE). In particular, when we learn the optimal prior, the VAE objective (the ELBO) becomes equal to the MI (up to a constant). In turn, for a deterministic encoder the ELBO is equal to the log Bayesian model evidence. This establishes a profound connection between Bayesian inference and information theory. However, practical InfoNCE methods do not use the MI as an objective: the MI is invariant to arbitrary invertible transformations, so using an MI objective can lead to highly entangled representations (Tschannen et al., 2019). Instead, the actual InfoNCE objective is a simplified lower bound on the MI which is loose even in the infinite sample limit. Thus, an objective that works (i.e. the actual InfoNCE objective) appears to be motivated as a loose bound on an objective that does not work (i.e. the true MI which gives arbitrarily entangled representations). We give an alternative motivation for the actual InfoNCE objective. In particular, we show that in the infinite sample limit, and for a particular choice of prior, the actual InfoNCE objective is equal to the log Bayesian model evidence (up to a constant). Thus, we argue that our VAE perspective gives a better motivation for InfoNCE than MI, as the actual InfoNCE objective is only loosely bounded by the MI, but is equal to the log Bayesian model evidence (up to a constant).  ( 2 min )
    Nonlinear Random Matrices and Applications to the Sum of Squares Hierarchy. (arXiv:2302.04462v1 [cs.CC])
    We develop new tools in the theory of nonlinear random matrices and apply them to study the performance of the Sum of Squares (SoS) hierarchy on average-case problems. The SoS hierarchy is a powerful optimization technique that has achieved tremendous success for various problems in combinatorial optimization, robust statistics and machine learning. It's a family of convex relaxations that lets us smoothly trade off running time for approximation guarantees. In recent works, it's been shown to be extremely useful for recovering structure in high dimensional noisy data. It also remains our best approach towards refuting the notorious Unique Games Conjecture. In this work, we analyze the performance of the SoS hierarchy on fundamental problems stemming from statistics, theoretical computer science and statistical physics. In particular, we show subexponential-time SoS lower bounds for the problems of the Sherrington-Kirkpatrick Hamiltonian, Planted Slightly Denser Subgraph, Tensor Principal Components Analysis and Sparse Principal Components Analysis. These SoS lower bounds involve analyzing large random matrices, wherein lie our main contributions. These results offer strong evidence for the truth of and insight into the low-degree likelihood ratio hypothesis, an important conjecture that predicts the power of bounded-time algorithms for hypothesis testing. We also develop general-purpose tools for analyzing the behavior of random matrices which are functions of independent random variables. Towards this, we build on and generalize the matrix variant of the Efron-Stein inequalities. In particular, our general theorem on matrix concentration recovers various results that have appeared in the literature. We expect these random matrix theory ideas to have other significant applications.  ( 2 min )
    Sample Complexity Using Infinite Multiview Models. (arXiv:2302.04292v1 [math.ST])
    Recent works have demonstrated that the convergence rate of a nonparametric density estimator can be greatly improved by using a low-rank estimator when the target density is a convex combination of separable probability densities with Lipschitz continuous marginals, i.e. a multiview model. However, this assumption is very restrictive and it is not clear to what degree these findings can be extended to general pdfs. This work answers this question by introducing a new way of characterizing a pdf's complexity, the non-negative Lipschitz spectrum (NL-spectrum), which, unlike smoothness properties, can be used to characterize virtually any pdf. Finite sample bounds are presented that are dependent on the target density's NL-spectrum. From this dimension-independent rates of convergence are derived that characterize when an NL-spectrum allows for a fast rate of convergence.  ( 2 min )

  • Open

    [D] Best available text to speech free AI model out there for english
    Greetings everyone. I am looking for the best text to speech AI model out there for english I am looking for links to the models you know as best If the model supports subtitle file to speech that would be even more awesome Like providing .srt or .vtt to generate speech - speeding up the necessary parts of speech to fit into durations Thank you very much again I will use this to replace audio of my older lecture recordings by providing a time generated manually corrected subtitle file like srt or vtt I am looking for any male sounding model that sounds natural ​ I have found this They colab and looks very easy to generate. I think I can automate it. But is this one the best? https://www.reddit.com/r/MachineLearning/comments/v9rigf/p_silero_tts_full_v3_release/ found this too but only female voice :/ https://www.reddit.com/r/MachineLearning/comments/ttgsr4/r_nixtts_an_incredibly_lightweight_texttospeech/ I need a male voice any other good ones? ​ submitted by /u/CeFurkan [link] [comments]  ( 43 min )
    [D] Is Efficient-Net same as MobileNetV2
    Quick question, is EfficientNet-V1 same as MobileNet-V2? I think they use the same backbone, the inverted linear residual block, no? submitted by /u/No_Oilve_6577 [link] [comments]  ( 42 min )
    [D] Speed up HuggingFace Inference Pipeline
    Running a pipeline sentiment analysis call with transformers on 16 cpu takes 6-9 seconds for one inference. How can I speed this up? My ideas, for your inputs please: Ray cluster - parallel computing, memory usage is high. Within the pipeline() call, use parameter batch_size. However, is batch_size not appropriate for cpu? HF Accelerate - not sure how to implement on a published model... Model distillation - not sure how to implement on a published model... Thanks in advance! submitted by /u/askingforhelp1111 [link] [comments]  ( 43 min )
    [R] Large Language Models Can Teach Themselves to Use Tools
    submitted by /u/MysteryInc152 [link] [comments]  ( 44 min )
    [D] Is it legal to use images or videos with copyright to train a model?
    Hello, I want to know if it is legal to use scraped video or images to train a predictive model, for example, If I scrape photos of faces in google, and after that, I share that model in order that a lot of people can detect faces in their applications, is that legal? submitted by /u/Tlaloc-Es [link] [comments]  ( 43 min )
    [D] What ML or ML-powered projects are you currently building?
    This would be for ones that aren't finished enough to post as a link on the weekend. Just things that are in progress. Include a screenshot if you can! submitted by /u/TikkunCreation [link] [comments]  ( 42 min )
    [Discussion] We are AI fully developed!
    Humans nowadays see that AI approaches human capabilities. But where is the limit? For sure AI will replace some jobs in the future and become better and more accurate. Later it will teach and train on self-taught data. Then I am asking, what is the difference between a human and AI? WHAT IF WE ARE THE FINAL FORM OF AI? Will AI later think it is a human and repeat human history by developing another AI? submitted by /u/SubZero0xFF [link] [comments]  ( 42 min )
    [D] Locally-runnable text to speech AI?
    I've got a 4090 and some stuff that I think it would be fun to have narrated. I've looked at some of the paid online options and $20-$30/mo for 2 hours of AI TTS is not gonna gut it. Can anyone point me to software that I can run locally that'll give me high quality? It seems like if people are making billions of waifus in stable diffusion there ought to be something like this out there. submitted by /u/gruevy [link] [comments]  ( 44 min )
    [D] Simulator for RL problems
    I have seen people advocate a simulator for RL problems a lot. I am not sure by simulator what do they mean exactly? Is it the exact simulation (then the problem becomes easy) or some kind of feedback loop (start with a naïve simulator and once we get data then keep improving the simulator – this looks similar to value iteration or policy iteration). I assume it’s really difficult to get a simulator for data generation (except for video games etc.). Also, If we already have a simulator, we can easily train a model-free RL (e.g. just planning). submitted by /u/atulcst [link] [comments]  ( 43 min )
    [Discussion] WASM equivalant to Gradio but without needing a server?
    I'm really impressed with gradio for making interactive webapps. I was wondering... Gradio basically runs off a server so you have to standup a server just to demo certain kinds of apps. Is there something similar out that that can handle basic tabular data plots without needing a server? I was thinking perhaps something like a WASM app that can point to csvs on AWS S3 and generate plots on the fly? submitted by /u/RogueStargun [link] [comments]  ( 43 min )
    [P] Did anyone manage to run the MusicLM implementation from lucidrains?
    I really want to play with the repo but I'm stuck at the last step of the instructions (https://github.com/lucidrains/musiclm-pytorch#usage-1). If anyone has tips, please let me know! Here's the issue I have: https://github.com/lucidrains/musiclm-pytorch/issues/13 submitted by /u/BackgroundPass2082 [link] [comments]  ( 42 min )
    [D] numbers of parameters that can affect on a model.
    Hi guys, my question is what is different between parameters and FLOPs in terms of computation times. I know that the FLOPs is related to the computation of input images. For example, higher the size, higher the figure. But, how much parameters can affect on a model compared to the metric? I understand that the weights, biases are parameters. But, the cost of computation about them makes me difficult to determine what should I get a specific model. I can measure the decision based on the FLOPs, which decrease time of training my model when they are lower. However, I also want to decide a specific model with the number of parameters. Thanks. submitted by /u/Mundane_Definition_8 [link] [comments]  ( 44 min )
    [P] I'm using Instruct GPT to show anti-clickbait summaries on youtube videos
    submitted by /u/AlesioRFM [link] [comments]  ( 46 min )
    [D] Experiences finding a job in the US as a Mexican currently working in the UK
    Hi, I am considering moving to the US, and I was wondering about the job market for people in Mexico and the chances of getting an offer. I know that in theory, it should be 'easier' due to the United States–Mexico–Canada Agreement by getting a TN visa. Are there any Mexicans here that found a job in the US as a machine learning engineer/data scientist? Would anyone have a pointer? I'll obviously research companies and send my resumes, just thought of posting here to see what is the experience of other people. submitted by /u/darcia_scientist [link] [comments]  ( 43 min )
    [P] Resume parsing + Cv analysis
    Hi ! so for my final year project I will be working on a cv parser and matching cvs with job postings, I'm thinking about fine tuning LayoutLM on my cvs dataset( of 5000 resumes or so not yet labeled) to get the structure of a resume (contact info , skills , education , etc) and then combine it with NER to identify the details in each section (name , uni name , date of start etc ) . Is it good enough or should I take another approach ? Or how would you tackle the problem ? feel free to share any ideas u have about this project Thank you ! submitted by /u/Melodic_Secretary_42 [link] [comments]  ( 45 min )
    [D] The Kaggle Book : Book by Konrad Banachewicz and Luca Massaron
    Does anyone have "The Kaggle Book : Book by Konrad Banachewicz and Luca Massaron" in pdf ?? Please share the link submitted by /u/narendra7799 [link] [comments]  ( 42 min )
    [D] Critique of statistics research from machine learning perspectives (and vice versa)?
    I was just looking around at some paper published by statisticians, I couldn't help but notice that the flavor of their research is vastly different. For example, one researcher wrote about a dozen paper on LASSO alone over the span of a decade, whereas LASSO is just given a power point slide worth of attention in ML. Why is there such a disparity and a divergence in the aim of these disciplines? Are there some good critique of these research fields from each other's perspective (not just on the technical aspects)? Perhaps by someone who works in both? submitted by /u/fromnighttilldawn [link] [comments]  ( 50 min )
    [D] Should I put my current or past affiliation on my EACL paper?
    Hey guys. I have got a paper accepted to the EACL 2023 conference. When I was working on the paper I did not have any official affiliation. I was working as an independent researcher. I have started my PhD at PSU recently. I was wondering if I should use my current affiliation on the paper. I am the corresponding author for this paper. Also, I am planning to use my PSU address for all research communications from now on instead of my gmail address. So putting my PSU affiliation would make sense in that way. So my question is, is it okay to use my current affiliation? submitted by /u/ibraheemMmoosa [link] [comments]  ( 43 min )
  • Open

    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Who remembers Clippy? I can't think of a better time for it's return #ClippyGPT
    submitted by /u/hakJav [link] [comments]  ( 41 min )
    Developed an AI tool for Google Docs - what do you think?
    submitted by /u/vfra32 [link] [comments]  ( 41 min )
    Telling ChatGPT (GPT-3) to go by Gipee Tithree ala Star Wars
    submitted by /u/the_ferryman_abides [link] [comments]  ( 40 min )
    Can Bing AI listen to podcasts and summarize them?
    submitted by /u/tlkop123 [link] [comments]  ( 41 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Create your own videos from text inputs with this easy-to-use stable-diffusion-based WebApp! Choose between up to 5 different models, a worlds'-first prompt helper, super smooth videos with interpolation and much more.
    submitted by /u/nonicknamefornic [link] [comments]  ( 6 min )
    Merge A Face & Style Together With The LORA Extraction Method
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Read the Latest Issue of Weekly Piece of Future for Insights about Robotics, AI, Biotech, and Space!
    submitted by /u/RushingRobotics_com [link] [comments]  ( 40 min )
    How exactly is an AI "punished"?
    Yes so - I saw quite a few people 'training' their artificial intelligence to do and learn certain things using reward and punishment. When the AI does something bad, it is punished and when it does something right it is rewarded. But how does this work exactly? How is AI punished? Does it feel pain? Is part of its code deleted? How does it feel? One such example is a special bot for Minecraft. This bot can do pretty much anything and has a hive mind with all the other bots. It can perfectly navigate through even the toughest terrain and is able to best the best of the best PVPers (weird way to write it I know XD) This AI was trained in three sections. First in singleplayer where it learned how to survive. The settings were set so that whenever the AI dies in-game it will receive punish…  ( 47 min )
    WordPress To Add AI-Generated Images And Content Writing Feature
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    2022 in review with Olivia Gamblin - The Machine Ethics Podcast
    submitted by /u/benbyford [link] [comments]  ( 40 min )
    Where can I find some recent and high-quality Artificial Intelligence Conferences?
    "Where can I find some recent and high-quality Artificial Intelligence Conferences to watch online? I've checked the O'Reilly website, but the videos there seem to be dated (6 years old). Any suggestions would be greatly appreciated. submitted by /u/xfocus3 [link] [comments]  ( 41 min )
    How to be a better AI
    submitted by /u/Legal-Ad-1650 [link] [comments]  ( 41 min )
    Header - Open Source AI Training Pool(Idea, just ironing out the flaws)
    THE PITCH - Chatgpt has 175 billion parameters assuming each parameter is a 64 bit float we can calculate that the approximate weight of just the weights are around 1.8 tb and just to be on the safe side 2 tb with all the logic required for the model to work. Arguably a better google, a very primitive programmer and capable of being a very capable general assistant is a very important set of traits that it seems to have perfected. But we also know it seems to have some inherent biases since the data that it was fed was probably biased and that is very dangerous in and of itself. We also don't have the weights ourselves to tinker with the chatbot. While currently modern consumer grade gpu's cannot run chatgpt smoothly, with advancement in gpu technology, specialized gpu geared towards …  ( 43 min )
    Treating AI Like Nuclear Bombs
    submitted by /u/SpawnOfCthun [link] [comments]  ( 40 min )
    How long until we get history movies with deep faked characters?
    Like a biographical movie on Stalin with a deep-faked dictator? submitted by /u/Dyllbill_ [link] [comments]  ( 40 min )
    Pretty cool AI productivity tool
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    Is this a possible/likely scenario and is there a name to it?
    I was experimenting with ChatGPT the other day. I found it impressive but still make subtle mistakes especially in more technical fields. It could definitely trick a layperson, but to an expert it's at times just a bunch of words meshed together that don't make sense. (Interestingly, it is also completely useless when describing video game walkthroughs, like how to find a certain item or access a certain area in a game, which should be very straightforward and unambiguous information.) Now let's say AI has a 95% accuracy. Since AI generated content is much easier to produce, there will only be more and more AI generated content on the internet. And since AI is trained from the information on the internet, eventually there will come a point where AI is learning mostly from other AI-generated content. One can imagine that now the AI will only have a 95%*95% accuracy, and so on and so forth. Eventually I imagine the accuracy of the information on the internet should deteriorate to a point where an AI like chatgpt is no longer trustworthy (at least in non-academic fields), as well as those AI-generated websites that you already see a lot on the internet. Possible/likely? Is there an academic name to this possibility? submitted by /u/profmarylowe [link] [comments]  ( 45 min )
    An honest review of Nick Bostrom's book about AI development: Superintelligence
    submitted by /u/DeeMore [link] [comments]  ( 6 min )
    Photoshop tutorial created using AI voiceover (ElevenLabs)
    submitted by /u/howardpinsky [link] [comments]  ( 41 min )
    AI Dream 157 - 2K SUBS CELEBRATION! 🥳🎉 MASTERPIECE - PART 2 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Don't you guys think it's time for a new, relatively contextualized, subreddit for AI conversations?
    Hear me out! I am not disregarding this subreddit in any way, but since the conversation on AI has broadened, my reasoning is that we would need a much broader subreddit that also answers questions about these new models and also for users not particularly interested in the nitty gritty of AI jargon. I am particularly interested in conversations revolving around AI-powered search engines, given that Microsoft just launched Bing, and Google intends to try and catch up with Bard AI in a couple of weeks. I know there are a lot of people who also want to have these conversations and they also need a community, right? I am, in fact, thinking of creating one! submitted by /u/b_wanker [link] [comments]  ( 41 min )
  • Open

    Any Literature Regarding Reinforcement Learning in Production?
    There is a alot of literature available on building ML/DL pipelines from data transformations to monitoring. Is there similar literature available for setting up RL projects in production? What Kinda tech stack would one use for managing such projects (other then the algorithm and machine learning libraries) and why? submitted by /u/ZIGGY-Zz [link] [comments]  ( 41 min )
    Best approach for creating c++ environment for python?
    pybind11 functions that do all logic standalone build, and glue it to python with "subprocess" and communicate via pipes standalone build with HTTP communication ​ If you have used pybind11, how have you dealt with managing "state"? submitted by /u/reinforcement_agent [link] [comments]  ( 42 min )
    Simulator for RL problems
    I have seen people advocate a simulator for RL problems a lot. I am not sure by simulator what do they mean exactly? Is it the exact simulation (then the problem becomes easy) or some kind of feedback loop (start with a naïve simulator and once we get data then keep improving the simulator – this looks similar to value iteration or policy iteration). I assume it’s really difficult to get a simulator for data generation (except for video games etc.). Also, If we already have a simulator, we can easily train a model-free RL (e.g. just planning). submitted by /u/atulcst [link] [comments]  ( 42 min )
    GitHub - riiswa/planning-multi-robot-gym: A Gymnasium environment for simulating multi-robot planning.
    submitted by /u/riiswa [link] [comments]  ( 40 min )
    Imitation learning with partial observablity
    Hello, I'm working on a project where the goal is to train policy in a POMDP with imitation learning. In the POMDP, I need to use a belief module or an lstm to represent the history of observations. I do not understand exactly how I can use the demonstrations for imitation in this situation. The demonstrations are trajectories of observations by some unknown policy which do not have a belief module. My goal is to learn a policy in POMDP with both RL and imitation so that if it faces a state in the distribution of the demonstration, it should act according to what was demonstrated, other wise or should act according to the RL reward. submitted by /u/Ill_Satisfaction_865 [link] [comments]  ( 42 min )
    Communication between Agents in MARL
    I'd like to use either RLLib or StableBaselines3+PantheonRL to create a multi-agent system where the agents can share information about their surroundings with each other. I'm finding this to be very hard to implement. Does anyone have a starting point about implementing communication in MARL systems with tensor flow agents? submitted by /u/tessherelurkingnow [link] [comments]  ( 41 min )
    Is option framework proper using online learning?
    Hi, I am wondering if option framework (one of HRL framework) is proper or not following online learning framework. Intuitively, option framework should exploit offline learning method because it needs to train a different NNs or function approximators (tabular Q) with updating termination function as well. We may expect that this forces us to use replay buffer. Otherwise, the efficiency of using option framework dramatically decreases. I observed this on my experiments as well. Are there any articles mentioning this? Thanks! submitted by /u/Pikachu930 [link] [comments]  ( 41 min )
  • Open

    Google Research, 2022 & beyond: Algorithmic advances
    Posted by Vahab Mirrokni, VP and Google Fellow, Google Research (This is Part 5 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) Robust algorithm design is the backbone of systems across Google, particularly for our ML and AI models. Hence, developing algorithms with improved efficiency, performance and speed remains a high priority as it empowers services ranging from Search and Ads to Maps and YouTube. Google Research has been at the forefront of this effort, developing many innovations from privacy-safe recommendation systems to scalable solutions for large-scale ML. In 2022, we continued this journey, and advanced the state-of-the-art in several related areas. Here we highlight our progress in a …  ( 95 min )
  • Open

    Helping companies deploy AI models more responsibly
    MIT spinout Verta offers tools to help companies introduce, monitor, and manage machine-learning models safely and at scale.  ( 10 min )
  • Open

    Identifying defense coverage schemes in NFL’s Next Gen Stats
    This post is co-written with Jonathan Jung, Mike Band, Michael Chi, and Thompson Bliss at the National Football League. A coverage scheme refers to the rules and responsibilities of each football defender tasked with stopping an offensive pass. It is at the core of understanding and analyzing any football defensive strategy. Classifying the coverage scheme […]  ( 14 min )
  • Open

    Metaverse Development: Building the Future of Virtual Reality
    The metaverse, a term popularised by science fiction, refers to a shared virtual space where users can interact with each other in a virtual environment. It’s a convergence of real and virtual worlds, creating a new reality that exists simultaneously with the physical world. With the rapid advancement of technology, particularly in the field of… Read More »Metaverse Development: Building the Future of Virtual Reality The post Metaverse Development: Building the Future of Virtual Reality appeared first on Data Science Central.  ( 20 min )
    5 Common Mistakes Blockchain Professionals Should Avoid
    The hype around blockchain technology is something people are passionate about. Many blockchain enthusiasts openly claim the efficiency of blockchain functionalities like enhancing transparency in the governance process, developing new cryptocurrencies, and better supply chain management. However, the proof of concept is applied in various industries establishing blockchain dominance.   Those organizations not undertaking the development… Read More »5 Common Mistakes Blockchain Professionals Should Avoid  The post 5 Common Mistakes Blockchain Professionals Should Avoid  appeared first on Data Science Central.  ( 21 min )
  • Open

    DiversiTree: A New Method to Efficiently Compute Diverse Sets of Near-Optimal Solutions to Mixed-Integer Optimization Problems. (arXiv:2204.03822v3 [cs.DM] UPDATED)
    While most methods for solving mixed-integer optimization problems compute a single optimal solution, a diverse set of near-optimal solutions can often lead to improved outcomes. We present a new method for finding a set of diverse solutions by emphasizing diversity within the search for near-optimal solutions. Specifically, within a branch-and-bound framework, we investigated parameterized node selection rules that explicitly consider diversity. Our results indicate that our approach significantly increases the diversity of the final solution set. When compared with two existing methods, our method runs with similar runtime as regular node selection methods and gives a diversity improvement between 12% and 190%. In contrast, popular node selection rules, such as best-first search, in some instances performed worse than state-of-the-art methods by more than 35% and gave an improvement of no more than 130%. Further, we find that our method is most effective when diversity in node selection is continuously emphasized after reaching a minimal depth in the tree and when the solution set has grown sufficiently large. Our method can be easily incorporated into integer programming solvers and has the potential to significantly increase the diversity of solution sets.  ( 2 min )
    Robust representations of oil wells' intervals via sparse attention mechanism. (arXiv:2212.14246v2 [cs.LG] UPDATED)
    Determining the characteristics of newly drilled wells (e.g. reservoir formation properties) is a major challenge. One of the corresponding tasks is a well-interval similarity assessment: if we can learn to predict which oilfields are rich and which are not by comparing them with existing ones, this will lead to significant cost reductions. There are three main requirements for applying machine learning to oil&gas data: high quality even for unreliable data, low manual effort and interpretability of the model itself. Neural networks can be used to address these challenges. The use of a self-supervised paradigm leads to automatic model construction. However, existing approaches lack interpretability, and their quality prevents their use in applications. In particular, existing approaches like LSTM suffer from short-term memory, paying more attention to the end of a sequence. Instead, neural networks with Transformer architecture cast their attention over all sequences to make a decision. To make them more efficient in terms of computational time and more robust to noisy or absent values, we introduce a limited attention mechanism similar to that of the Informer architecture that considers only top correspondences. We run experiments on an open dataset with more than $20$ wells, making our experiments reliable and suitable for industrial use. The best results were obtained with our adaptation of the Informer variant of Transformer with ROC AUC $0.982$. It outperforms classical approaches with ROC AUC $0.824$, recurrent neural networks (RNNs) with ROC AUC $0.934$ and the direct use of Transformer with ROC AUC $0.961$. We show that well-interval representations obtained by Informer are of higher quality than those extracted by RNNs. Moreover, the obtained attention is interpretable, as it corresponds to the importance of a particular part of an interval for the similarity estimation.  ( 3 min )
    Automated Identification of Toxic Code Reviews Using ToxiCR. (arXiv:2202.13056v3 [cs.SE] UPDATED)
    Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy interactions among its members. However, off-the-shelf toxicity detectors perform poorly on Software Engineering (SE) datasets, such as one curated from code review comments. To encounter this challenge, we present ToxiCR, a supervised learning-based toxicity identification tool for code review interactions. ToxiCR includes a choice to select one of the ten supervised learning algorithms, an option to select text vectorization techniques, eight preprocessing steps, and a large-scale labeled dataset of 19,571 code review comments. Two out of those eight preprocessing steps are SE domain specific. With our rigorous evaluation of the models with various combinations of preprocessing steps and vectorization techniques, we have identified the best combination for our dataset that boosts 95.8% accuracy and 88.9% F1 score. ToxiCR significantly outperforms existing toxicity detectors on our dataset. We have released our dataset, pre-trained models, evaluation results, and source code publicly available at: https://github.com/WSU-SEAL/ToxiCR  ( 2 min )
    Shortcut Detection with Variational Autoencoders. (arXiv:2302.04246v1 [cs.LG])
    For real-world applications of machine learning (ML), it is essential that models make predictions based on well-generalizing features rather than spurious correlations in the data. The identification of such spurious correlations, also known as shortcuts, is a challenging problem and has so far been scarcely addressed. In this work, we present a novel approach to detect shortcuts in image and audio datasets by leveraging variational autoencoders (VAEs). The disentanglement of features in the latent space of VAEs allows us to discover correlations in datasets and semi-automatically evaluate them for ML shortcuts. We demonstrate the applicability of our method on several real-world datasets and identify shortcuts that have not been discovered before. Based on these findings, we also investigate the construction of shortcut adversarial examples.  ( 2 min )
    Revisiting the Linear-Programming Framework for Offline RL with General Function Approximation. (arXiv:2212.13861v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to find an optimal policy for sequential decision-making using a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and provide a new reformulation that advances the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, also with careful choices of the function classes and initial state distributions. We hope our insights bring into light the use of LP formulations and the induced primal-dual minimax optimization, in offline RL.  ( 2 min )
    Alternately Optimized Graph Neural Networks. (arXiv:2206.03638v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have greatly advanced the semi-supervised node classification task on graphs. The majority of existing GNNs are trained in an end-to-end manner that can be viewed as tackling a bi-level optimization problem. This process is often inefficient in computation and memory usage. In this work, we propose a new optimization framework for semi-supervised learning on graphs. The proposed framework can be conveniently solved by the alternating optimization algorithms, resulting in significantly improved efficiency. Extensive experiments demonstrate that the proposed method can achieve comparable or better performance with state-of-the-art baselines while it has significantly better computation and memory efficiency.  ( 2 min )
    Making Progress Based on False Discoveries. (arXiv:2204.08809v2 [cs.LG] UPDATED)
    The study of adaptive data analysis examines how many statistical queries can be answered accurately using a fixed dataset while avoiding false discoveries (statistically inaccurate answers). In this paper, we tackle a question that precedes the field of study: Is data only valuable when it provides accurate answers to statistical queries? To answer this question, we use Stochastic Convex Optimization as a case study. In this model, algorithms are considered as analysts who query an estimate of the gradient of a noisy function at each iteration and move towards its minimizer. It is known that $O(1/\epsilon^2)$ examples can be used to minimize the objective function, but none of the existing methods depend on the accuracy of the estimated gradients along the trajectory. Therefore, we ask: How many samples are needed to minimize a noisy convex function if we require $\epsilon$-accurate estimates of $O(1/\epsilon^2)$ gradients? Or, might it be that inaccurate gradient estimates are \emph{necessary} for finding the minimum of a stochastic convex function at an optimal statistical rate? We provide two partial answers to this question. First, we show that a general analyst (queries that may be maliciously chosen) requires $\Omega(1/\epsilon^3)$ samples, ruling out the possibility of a foolproof mechanism. Second, we show that, under certain assumptions on the oracle, $\tilde \Omega(1/\epsilon^{2.5})$ samples are necessary for gradient descent to interact with the oracle. Our results are in contrast to classical bounds that show that $O(1/\epsilon^2)$ samples can optimize the population risk to an accuracy of $O(\epsilon)$, but with spurious gradients.  ( 2 min )
    Guaranteed Conformance of Neurosymbolic Models to Natural Constraints. (arXiv:2212.01346v4 [cs.LG] UPDATED)
    Deep neural networks have emerged as the workhorse for a large section of robotics and control applications, especially as models for dynamical systems. Such data-driven models are in turn used for designing and verifying autonomous systems. This is particularly useful in modeling medical systems where data can be leveraged to individualize treatment. In safety-critical applications, it is important that the data-driven model is conformant to established knowledge from the natural sciences. Such knowledge is often available or can often be distilled into a (possibly black-box) model $M$. For instance, the unicycle model (which encodes Newton's laws) for an F1 racing car. In this light, we consider the following problem - given a model $M$ and state transition dataset, we wish to best approximate the system model while being bounded distance away from $M$. We propose a method to guarantee this conformance. Our first step is to distill the dataset into few representative samples called memories, using the idea of a growing neural gas. Next, using these memories we partition the state space into disjoint subsets and compute bounds that should be respected by the neural network, when the input is drawn from a particular subset. This serves as a symbolic wrapper for guaranteed conformance. We argue theoretically that this only leads to bounded increase in approximation error; which can be controlled by increasing the number of memories. We experimentally show that on three case studies (Car Model, Drones, and Artificial Pancreas), our constrained neurosymbolic models conform to specified $M$ models (each encoding various constraints) with order-of-magnitude improvements compared to the augmented Lagrangian and vanilla training methods. Our code can be found at https://github.com/kaustubhsridhar/Constrained_Models  ( 3 min )
    User-Aware Algorithmic Recourse with Preference Elicitation. (arXiv:2205.13743v2 [cs.LG] UPDATED)
    Counterfactual interventions are a powerful tool to explain the decisions of a black-box decision process and to enable algorithmic recourse. They are a sequence of actions that, if performed by a user, can overturn an unfavourable decision made by an automated decision system. However, most of the current methods provide interventions without considering the user's preferences. In this work, we propose a shift of paradigm by providing a novel formalization which considers the user as an active part of the process rather than a mere target. Following the preference elicitation setting, we introduce the first human-in-the-loop approach to perform algorithmic recourse. We also present a polynomial procedure to ask questions which maximize the Expected Utility of Selection (EUS), a measure of the utility of the choice set that accounts for the uncertainty with respect to both the model and the user response. We use it to iteratively refine our cost estimates in a Bayesian fashion. We integrate this preference elicitation strategy into a reinforcement learning agent coupled with Monte Carlo Tree Search for the efficient exploration, so as to provide personalized interventions achieving algorithmic recourse. An experimental evaluation of synthetic and real-world datasets shows that a handful of queries allows for achieving a substantial reduction in the cost of interventions with respect to user-independent alternatives.  ( 2 min )
    Look where you look! Saliency-guided Q-networks for generalization in visual Reinforcement Learning. (arXiv:2209.09203v3 [cs.LG] UPDATED)
    Deep reinforcement learning policies, despite their outstanding efficiency in simulated visual control tasks, have shown disappointing ability to generalize across disturbances in the input training images. Changes in image statistics or distracting background elements are pitfalls that prevent generalization and real-world applicability of such control policies. We elaborate on the intuition that a good visual policy should be able to identify which pixels are important for its decision, and preserve this identification of important sources of information across images. This implies that training of a policy with small generalization gap should focus on such important pixels and ignore the others. This leads to the introduction of saliency-guided Q-networks (SGQN), a generic method for visual reinforcement learning, that is compatible with any value function learning method. SGQN vastly improves the generalization capability of Soft Actor-Critic agents and outperforms existing stateof-the-art methods on the Deepmind Control Generalization benchmark, setting a new reference in terms of training efficiency, generalization gap, and policy interpretability.  ( 2 min )
    Latent Neural ODEs with Sparse Bayesian Multiple Shooting. (arXiv:2210.03466v2 [cs.LG] UPDATED)
    Training dynamic models, such as neural ODEs, on long trajectories is a hard problem that requires using various tricks, such as trajectory splitting, to make model training work in practice. These methods are often heuristics with poor theoretical justifications, and require iterative manual tuning. We propose a principled multiple shooting technique for neural ODEs that splits the trajectories into manageable short segments, which are optimised in parallel, while ensuring probabilistic control on continuity over consecutive segments. We derive variational inference for our shooting-based latent neural ODE models and propose amortized encodings of irregularly sampled trajectories with a transformer-based recognition network with temporal attention and relative positional encoding. We demonstrate efficient and stable training, and state-of-the-art performance on multiple large-scale benchmark datasets.  ( 2 min )
    Versatile Skill Control via Self-supervised Adversarial Imitation of Unlabeled Mixed Motions. (arXiv:2209.07899v2 [cs.RO] UPDATED)
    Learning diverse skills is one of the main challenges in robotics. To this end, imitation learning approaches have achieved impressive results. These methods require explicitly labeled datasets or assume consistent skill execution to enable learning and active control of individual behaviors, which limits their applicability. In this work, we propose a cooperative adversarial method for obtaining single versatile policies with controllable skill sets from unlabeled datasets containing diverse state transition patterns by maximizing their discriminability. Moreover, we show that by utilizing unsupervised skill discovery in the generative adversarial imitation learning framework, novel and useful skills emerge with successful task fulfillment. Finally, the obtained versatile policies are tested on an agile quadruped robot called Solo 8 and present faithful replications of diverse skills encoded in the demonstrations.  ( 2 min )
    Identify ambiguous tasks combining crowdsourced labels by weighting Areas Under the Margin. (arXiv:2209.15380v2 [cs.LG] UPDATED)
    In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per-worker trust score. Yet, such worker-centric approaches discard each task's ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful to the learning step. In a standard supervised learning setting - with one label per task - the Area Under the Margin (AUM) is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by task-dependent scores. We show that the WAUM can help discard ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements over existing strategies for learning a crowd, both for simulated settings and for the CIFAR-10H, LabelMe and Music crowdsourced datasets.  ( 2 min )
    Reconstruction-guided attention improves the robustness and shape processing of neural networks. (arXiv:2209.13620v2 [cs.CV] UPDATED)
    Many visual phenomena suggest that humans use top-down generative or reconstructive processes to create visual percepts (e.g., imagery, object completion, pareidolia), but little is known about the role reconstruction plays in robust object recognition. We built an iterative encoder-decoder network that generates an object reconstruction and used it as top-down attentional feedback to route the most relevant spatial and feature information to feed-forward object recognition processes. We tested this model using the challenging out-of-distribution digit recognition dataset, MNIST-C, where 15 different types of transformation and corruption are applied to handwritten digit images. Our model showed strong generalization performance against various image perturbations, on average outperforming all other models including feedforward CNNs and adversarially trained networks. Our model is particularly robust to blur, noise, and occlusion corruptions, where shape perception plays an important role. Ablation studies further reveal two complementary roles of spatial and feature-based attention in robust object recognition, with the former largely consistent with spatial masking benefits in the attention literature (the reconstruction serves as a mask) and the latter mainly contributing to the model's inference speed (i.e., number of time steps to reach a certain confidence threshold) by reducing the space of possible object hypotheses. We also observed that the model sometimes hallucinates a non-existing pattern out of noise, leading to highly interpretable human-like errors. Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism, which can help us understand the role of generating perception in human visual processing.  ( 2 min )
    Scalars are universal: Equivariant machine learning, structured like classical physics. (arXiv:2106.06610v4 [cs.LG] UPDATED)
    There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar\'e groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be universally expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. We complement our theory with numerical examples that show that the scalar-based method is simple, efficient, and scalable.  ( 2 min )
    What do we learn? Debunking the Myth of Unsupervised Outlier Detection. (arXiv:2206.03698v2 [cs.CV] UPDATED)
    Even though auto-encoders (AEs) have the desirable property of learning compact representations without labels and have been widely applied to out-of-distribution (OoD) detection, they are generally still poorly understood and are used incorrectly in detecting outliers where the normal and abnormal distributions are strongly overlapping. In general, the learned manifold is assumed to contain key information that is only important for describing samples within the training distribution, and that the reconstruction of outliers leads to high residual errors. However, recent work suggests that AEs are likely to be even better at reconstructing some types of OoD samples. In this work, we challenge this assumption and investigate what auto-encoders actually learn when they are posed to solve two different tasks. First, we propose two metrics based on the Fr\'echet inception distance (FID) and confidence scores of a trained classifier to assess whether AEs can learn the training distribution and reliably recognize samples from other domains. Second, we investigate whether AEs are able to synthesize normal images from samples with abnormal regions, on a more challenging lung pathology detection task. We have found that state-of-the-art (SOTA) AEs are either unable to constrain the latent manifold and allow reconstruction of abnormal patterns, or they are failing to accurately restore the inputs from their latent distribution, resulting in blurred or misaligned reconstructions. We propose novel deformable auto-encoders (MorphAEus) to learn perceptually aware global image priors and locally adapt their morphometry based on estimated dense deformation fields. We demonstrate superior performance over unsupervised methods in detecting OoD and pathology.  ( 2 min )
    Equivariance-aware Architectural Optimization of Neural Networks. (arXiv:2210.05484v3 [cs.LG] UPDATED)
    Incorporating equivariance to symmetry groups as a constraint during neural network training can improve performance and generalization for tasks exhibiting those symmetries, but such symmetries are often not perfectly nor explicitly present. This motivates algorithmically optimizing the architectural constraints imposed by equivariance. We propose the equivariance relaxation morphism, which preserves functionality while reparameterizing a group equivariant layer to operate with equivariance constraints on a subgroup, as well as the [G]-mixed equivariant layer, which mixes layers constrained to different groups to enable within-layer equivariance optimization. We further present evolutionary and differentiable neural architecture search (NAS) algorithms that utilize these mechanisms respectively for equivariance-aware architectural optimization. Experiments across a variety of datasets show the benefit of dynamically constrained equivariance to find effective architectures with approximate equivariance.  ( 2 min )
    Streaming Encoding Algorithms for Scalable Hyperdimensional Computing. (arXiv:2209.09868v4 [cs.LG] UPDATED)
    Hyperdimensional computing (HDC) is a paradigm for data representation and learning originating in computational neuroscience. HDC represents data as high-dimensional, low-precision vectors which can be used for a variety of information processing tasks like learning or recall. The mapping to high-dimensional space is a fundamental problem in HDC, and existing methods encounter scalability issues when the input data itself is high-dimensional. In this work, we explore a family of streaming encoding techniques based on hashing. We show formally that these methods enjoy comparable guarantees on performance for learning applications while being substantially more efficient than existing alternatives. We validate these results experimentally on a popular high-dimensional classification problem and show that our approach easily scales to very large data sets.
    A Graph Neural Network Approach to Automated Model Building in Cryo-EM Maps. (arXiv:2210.00006v3 [q-bio.QM] UPDATED)
    Electron cryo-microscopy (cryo-EM) produces three-dimensional (3D) maps of the electrostatic potential of biological macromolecules, including proteins. Along with knowledge about the imaged molecules, cryo-EM maps allow de novo atomic modelling, which is typically done through a laborious manual process. Taking inspiration from recent advances in machine learning applications to protein structure prediction, we propose a graph neural network (GNN) approach for automated model building of proteins in cryo-EM maps. The GNN acts on a graph with nodes assigned to individual amino acids and edges representing the protein chain. Combining information from the voxel-based cryo-EM data, the amino acid sequence data and prior knowledge about protein geometries, the GNN refines the geometry of the protein chain and classifies the amino acids for each of its nodes. Application to 28 test cases shows that our approach outperforms the state-of-the-art and approximates manual building for cryo-EM maps with resolutions better than 3.5 \r{A}.
    Flow Matching for Generative Modeling. (arXiv:2210.02747v2 [cs.LG] UPDATED)
    We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.
    Quantifying Context Mixing in Transformers. (arXiv:2301.12971v2 [cs.CL] UPDATED)
    Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models' decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in the output representations. In this work, by expanding the scope of analysis to the whole encoder block, we propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer. We demonstrate the superiority of our context mixing score over other analysis methods through a series of complementary evaluations with different viewpoints based on linguistically informed rationales, probing, and faithfulness analysis.
    Physics-Constrained Climate Downscaling. (arXiv:2208.05424v2 [physics.ao-ph] UPDATED)
    The availability of reliable, high-resolution climate and weather data is important to inform long-term decisions on climate adaptation and mitigation and to guide rapid responses to extreme events. Forecasting models are limited by computational costs and, therefore, often generate coarse-resolution predictions. Statistical downscaling, including super-resolution methods from deep learning, can provide an efficient method of upsampling low-resolution data. However, despite achieving visually compelling results in some cases, such models frequently violate conservation laws when predicting physical variables. In order to conserve physical quantities, we develop methods that guarantee physical constraints are satisfied by a deep learning downscaling model while also improving their performance according to traditional metrics. We compare different constraining approaches and demonstrate their applicability across different neural architectures as well as a variety of climate and weather datasets. Besides enabling faster and more accurate climate predictions, we also show that our novel methodologies can improve super-resolution for satellite data and standard datasets.
    Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories. (arXiv:2210.06518v2 [cs.LG] UPDATED)
    Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful - on several D4RL benchmarks, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% trajectories from the low return regime. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
    Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization. (arXiv:2212.09396v2 [stat.ML] UPDATED)
    The nonconvex formulation of matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient descent (GD) is the simplest yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizers to prove the convergence of GD. In this work, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in logarithmic amount of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence and show that a larger initialization can be used as more samples are available. We observe that implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.
    Convolutional Learning on Multigraphs. (arXiv:2209.11354v2 [cs.LG] UPDATED)
    Graph convolutional learning has led to many exciting discoveries in diverse areas. However, in some applications, traditional graphs are insufficient to capture the structure and intricacies of the data. In such scenarios, multigraphs arise naturally as discrete structures in which complex dynamics can be embedded. In this paper, we develop convolutional information processing on multigraphs and introduce convolutional multigraph neural networks (MGNNs). To capture the complex dynamics of information diffusion within and across each of the multigraph's classes of edges, we formalize a convolutional signal processing model, defining the notions of signals, filtering, and frequency representations on multigraphs. Leveraging this model, we develop a multigraph learning architecture, including a sampling procedure to reduce computational complexity. The introduced architecture is applied towards optimal wireless resource allocation and a hate speech localization task, offering improved performance over traditional graph neural networks.
    Relative Probability on Finite Outcome Spaces: A Systematic Examination of its Axiomatization, Properties, and Applications. (arXiv:2212.14555v2 [stat.ML] UPDATED)
    This work proposes a view of probability as a relative measure rather than an absolute one. To demonstrate this concept, we focus on finite outcome spaces and develop three fundamental axioms that establish requirements for relative probability functions. We then provide a library of examples of these functions and a system for composing them. Additionally, we discuss a relative version of Bayesian inference and its digital implementation. Finally, we prove the topological closure of the relative probability space, highlighting its ability to preserve information under limits.
    WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. (arXiv:2207.01206v4 [cs.CL] UPDATED)
    Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop -- a simulated e-commerce website environment with $1.18$ million real-world products and $12,087$ crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. We collect over $1,600$ human demonstrations for the task, and train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of $29\%$, which outperforms rule-based heuristics ($9.6\%$) but is far lower than human expert performance ($59\%$). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show that agents trained on WebShop exhibit non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of WebShop in developing practical web-based agents that can operate in the wild.
    On Grounded Planning for Embodied Tasks with Language Models. (arXiv:2209.00465v2 [cs.AI] UPDATED)
    Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear whether they have the capacity to generate grounded, executable plans for embodied tasks. This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment. In this paper, we address this important research question and present the first investigation into the topic. Our novel problem formulation, named G-PlanET, inputs a high-level goal and a data table about objects in a specific environment, and then outputs a step-by-step actionable plan for a robotic agent to follow. To facilitate the study, we establish an evaluation protocol and design a dedicated metric, KAS, to assess the quality of the plans. Our experiments demonstrate that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs' ability in grounded planning. Our analysis also reveals interesting and non-trivial findings.
    LLEDA -- Lifelong Self-Supervised Domain Adaptation. (arXiv:2211.09027v2 [cs.LG] UPDATED)
    Humans and animals have the ability to continuously learn new information over their lifetime without losing previously acquired knowledge. However, artificial neural networks struggle with this due to new information conflicting with old knowledge, resulting in catastrophic forgetting. The complementary learning systems (CLS) theory suggests that the interplay between hippocampus and neocortex systems enables long-term and efficient learning in the mammalian brain, with memory replay facilitating the interaction between these two systems to reduce forgetting. The proposed Lifelong Self-Supervised Domain Adaptation (LLEDA) framework draws inspiration from the CLS theory and mimics the interaction between two networks: a DA network inspired by the hippocampus that quickly adjusts to changes in data distribution and an SSL network inspired by the neocortex that gradually learns domain-agnostic general representations. LLEDA's latent replay technique facilitates communication between these two networks by reactivating and replaying the past memory latent representations to stabilise long-term generalisation and retention without interfering with the previously learned information. Extensive experiments demonstrate that the proposed method outperforms several other methods resulting in a long-term adaptation while being less prone to catastrophic forgetting when transferred to new domains.
    Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling. (arXiv:2207.04179v2 [cs.LG] UPDATED)
    Neural Processes (NPs) are a popular class of approaches for meta-learning. Similar to Gaussian Processes (GPs), NPs define distributions over functions and can estimate uncertainty in their predictions. However, unlike GPs, NPs and their variants suffer from underfitting and often have intractable likelihoods, which limit their applications in sequential decision making. We propose Transformer Neural Processes (TNPs), a new member of the NP family that casts uncertainty-aware meta learning as a sequence modeling problem. We learn TNPs via an autoregressive likelihood-based objective and instantiate it with a novel transformer-based architecture. The model architecture respects the inductive biases inherent to the problem structure, such as invariance to the observed data points and equivariance to the unobserved points. We further investigate knobs within the TNP framework that tradeoff expressivity of the decoding distribution with extra computation. Empirically, we show that TNPs achieve state-of-the-art performance on various benchmark problems, outperforming all previous NP variants on meta regression, image completion, contextual multi-armed bandits, and Bayesian optimization.
    Adversarial Self-Attention for Language Understanding. (arXiv:2206.12608v3 [cs.CL] UPDATED)
    Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut'' between the labels and inputs, thus impairing the generalization and robustness. This paper advances the self-attention mechanism to its robust variant for Transformer-based pre-trained language models (e.g. BERT). We propose \textit{Adversarial Self-Attention} mechanism (ASA), which adversarially biases the attentions to effectively suppress the model reliance on features (e.g. specific keywords) and encourage its exploration of broader semantics. We conduct a comprehensive evaluation across a wide range of tasks for both pre-training and fine-tuning stages. For pre-training, ASA unfolds remarkable performance gains compared to naive training for longer steps. For fine-tuning, ASA-empowered models outweigh naive models by a large margin considering both generalization and robustness.
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v2 [stat.ML] UPDATED)
    Most machine learning methods require tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate an approximate expression for how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection heuristic based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.
    Modified Policy Iteration for Exponential Cost Risk Sensitive MDPs. (arXiv:2302.03811v1 [cs.LG])
    Modified policy iteration (MPI) also known as optimistic policy iteration is at the core of many reinforcement learning algorithms. It works by combining elements of policy iteration and value iteration. The convergence of MPI has been well studied in the case of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, modified policy iteration is relatively unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems. The proof of approximate modified policy iteration for risk sensitive MDPs is also provided in the appendix.  ( 2 min )
    Provably Efficient Offline Goal-Conditioned Reinforcement Learning with General Function Approximation and Single-Policy Concentrability. (arXiv:2302.03770v1 [cs.LG])
    Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills which aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we propose a novel provably efficient algorithm (the sample complexity is $\tilde{O}({\rm poly}(1/\epsilon))$ where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation. Our algorithm only requires nearly minimal assumptions of the dataset (single-policy concentrability) and the function class (realizability). Moreover, our algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. To the best of our knowledge, this is the first algorithm with general function approximation and single-policy concentrability that is both statistically efficient and computationally stable.  ( 2 min )
    Self-Supervised Unseen Object Instance Segmentation via Long-Term Robot Interaction. (arXiv:2302.03793v1 [cs.RO])
    We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the fine-tuned networks improve top-down robotic grasping of unseen objects in the real world.  ( 2 min )
    Graph Signal Sampling for Inductive One-Bit Matrix Completion: a Closed-form Solution. (arXiv:2302.03933v1 [cs.LG])
    Inductive one-bit matrix completion is motivated by modern applications such as recommender systems, where new users would appear at test stage with the ratings consisting of only ones and no zeros. We propose a unified graph signal sampling framework which enjoys the benefits of graph signal analysis and processing. The key idea is to transform each user's ratings on the items to a function (signal) on the vertices of an item-item graph, then learn structural graph properties to recover the function from its values on certain vertices -- the problem of graph signal sampling. We propose a class of regularization functionals that takes into account discrete random label noise in the graph vertex domain, then develop the GS-IMC approach which biases the reconstruction towards functions that vary little between adjacent vertices for noise reduction. Theoretical result shows that accurate reconstructions can be achieved under mild conditions. For the online setting, we develop a Bayesian extension, i.e., BGS-IMC which considers continuous random Gaussian noise in the graph Fourier domain and builds upon a prediction-correction update algorithm to obtain the unbiased and minimum-variance reconstruction. Both GS-IMC and BGS-IMC have closed-form solutions and thus are highly scalable in large data. Experiments show that our methods achieve state-of-the-art performance on public benchmarks.  ( 2 min )
    Non-Stationary Bandits with Knapsack Problems with Advice. (arXiv:2302.04182v1 [cs.LG])
    We consider a non-stationary Bandits with Knapsack problem. The outcome distribution at each time is scaled by a non-stationary quantity that signifies changing demand volumes. Instead of studying settings with limited non-stationarity, we investigate how online predictions on the total demand volume $Q$ allows us to improve our performance guarantees. We show that, without any prediction, any online algorithm incurs a linear-in-$T$ regret. In contrast, with online predictions on $Q$, we propose an online algorithm that judiciously incorporates the predictions, and achieve regret bounds that depends on the accuracy of the predictions. These bounds are shown to be tight in settings when prediction accuracy improves across time. Our theoretical results are corroborated by our numerical findings.
    Deep Learning Based Walking Tasks Classification in Older Adults using fNIRS. (arXiv:2102.03987v3 [cs.LG] UPDATED)
    Decline in gait features is common in older adults and an indicator of increased risk of disability, morbidity, and mortality. Under dual task walking (DTW) conditions, further degradation in the performance of both the gait and the secondary cognitive task were found in older adults which were significantly correlated to falls history. Cortical control of gait, specifically in the pre-frontal cortex (PFC) as measured by functional near infrared spectroscopy (fNIRS), during DTW in older adults has recently been studied. However, the automatic classification of differences in cognitive activations under single and dual task gait conditions has not been extensively studied yet. In this paper, we formulate this as a classification task and leverage deep learning to perform automatic classification of STW, DTW and single cognitive task (STA). We conduct analysis on the data samples which reveals the characteristics on the difference between HbO2 and Hb values that are subsequently used as additional features. We perform feature engineering to formulate the fNIRS features as a 3-channel image and apply various image processing techniques for data augmentation to enhance the performance of deep learning models. Experimental results show that pre-trained deep learning models that are fine-tuned using the collected fNIRS dataset together with gender and cognitive status information can achieve around 81\% classification accuracy which is about 10\% higher than the traditional machine learning algorithms. We further perform an ablation study to identify rankings of features such as the fNIRS levels and/or voxel locations on the contribution of the classification task.
    Decision trees compensate for model misspecification. (arXiv:2302.04081v1 [stat.ML])
    The best-performing models in ML are not interpretable. If we can explain why they outperform, we may be able to replicate these mechanisms and obtain both interpretability and performance. One example are decision trees and their descendent gradient boosting machines (GBMs). These perform well in the presence of complex interactions, with tree depth governing the order of interactions. However, interactions cannot fully account for the depth of trees found in practice. We confirm 5 alternative hypotheses about the role of tree depth in performance in the absence of true interactions, and present results from experiments on a battery of datasets. Part of the success of tree models is due to their robustness to various forms of mis-specification. We present two methods for robust generalized linear models (GLMs) addressing the composite and mixed response scenarios.
    Algorithmic Collective Action in Machine Learning. (arXiv:2302.04262v1 [cs.LG])
    We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm's learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: the case of a nonparametric optimal learning algorithm, a parametric risk minimizer, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective's size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform's learning algorithm.
    Transformers Can Do Bayesian Inference. (arXiv:2112.10510v6 [cs.LG] UPDATED)
    Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released at https://github.com/automl/TransformersCanDoBayesianInference.
    WF-UNet: Weather Fusion UNet for Precipitation Nowcasting. (arXiv:2302.04102v1 [cs.LG])
    Designing early warning systems for harsh weather and its effects, such as urban flooding or landslides, requires accurate short-term forecasts (nowcasts) of precipitation. Nowcasting is a significant task with several environmental applications, such as agricultural management or increasing flight safety. In this study, we investigate the use of a UNet core-model and its extension for precipitation nowcasting in western Europe for up to 3 hours ahead. In particular, we propose the Weather Fusion UNet (WF-UNet) model, which utilizes the Core 3D-UNet model and integrates precipitation and wind speed variables as input in the learning process and analyze its influences on the precipitation target task. We have collected six years of precipitation and wind radar images from Jan 2016 to Dec 2021 of 14 European countries, with 1-hour temporal resolution and 31 square km spatial resolution based on the ERA5 dataset, provided by Copernicus, the European Union's Earth observation programme. We compare the proposed WF-UNet model to persistence model as well as other UNet based architectures that are trained only using precipitation radar input data. The obtained results show that WF-UNet outperforms the other examined best-performing architectures by 22%, 8% and 6% lower MSE at a horizon of 1, 2 and 3 hours respectively.
    Improved Langevin Monte Carlo for stochastic optimization via landscape modification. (arXiv:2302.03973v1 [math.PR])
    Given a target function $H$ to minimize or a target Gibbs distribution $\pi_{\beta}^0 \propto e^{-\beta H}$ to sample from in the low temperature, in this paper we propose and analyze Langevin Monte Carlo (LMC) algorithms that run on an alternative landscape as specified by $H^f_{\beta,c,1}$ and target a modified Gibbs distribution $\pi^f_{\beta,c,1} \propto e^{-\beta H^f_{\beta,c,1}}$, where the landscape of $H^f_{\beta,c,1}$ is a transformed version of that of $H$ which depends on the parameters $f,\beta$ and $c$. While the original Log-Sobolev constant affiliated with $\pi^0_{\beta}$ exhibits exponential dependence on both $\beta$ and the energy barrier $M$ in the low temperature regime, with appropriate tuning of these parameters and subject to assumptions on $H$, we prove that the energy barrier of the transformed landscape is reduced which consequently leads to polynomial dependence on both $\beta$ and $M$ in the modified Log-Sobolev constant associated with $\pi^f_{\beta,c,1}$. This yield improved total variation mixing time bounds and improved convergence toward a global minimum of $H$. We stress that the technique developed in this paper is not only limited to LMC and is broadly applicable to other gradient-based optimization or sampling algorithms.
    The Modern Mathematics of Deep Learning. (arXiv:2105.04026v2 [cs.LG] UPDATED)
    We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.
    Machine Learning for Synthetic Data Generation: a Review. (arXiv:2302.04062v1 [cs.LG])
    Data plays a crucial role in machine learning. However, in real-world applications, there are several problems with data, e.g., data are of low quality; a limited number of data points lead to under-fitting of the machine learning model; it is hard to access the data due to privacy, safety and regulatory concerns. \textit{Synthetic data generation} offers a promising new avenue, as it can be shared and used in ways that real-world data cannot. This paper systematically reviews the existing works that leverage machine learning models for synthetic data generation. Specifically, we discuss the synthetic data generation works from several perspectives: (i) applications, including computer vision, speech, natural language, healthcare, and business; (ii) machine learning methods, particularly neural network architectures and deep generative models; (iii) privacy and fairness issue. In addition, we identify the challenges and opportunities in this emerging field and suggest future research directions.
    On the Complexity of Computing G\"odel Numbers. (arXiv:2302.04213v1 [math.LO])
    Given a computable sequence of natural numbers, it is a natural task to find a G\"odel number of a program that generates this sequence. It is easy to see that this problem is neither continuous nor computable. In algorithmic learning theory this problem is well studied from several perspectives and one question studied there is for which sequences this problem is at least learnable in the limit. Here we study the problem on all computable sequences and we classify the Weihrauch complexity of it. For this purpose we can, among other methods, utilize the amalgamation technique known from learning theory. As a benchmark for the classification we use closed and compact choice problems and their jumps on natural numbers, and we argue that these problems correspond to induction and boundedness principles, as they are known from the Kirby-Paris hierarchy in reverse mathematics. We provide a topological as well as a computability-theoretic classification, which reveal some significant differences.
    A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech. (arXiv:2302.04215v1 [eess.AS])
    Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.
    PreCNet: Next-Frame Video Prediction Based on Predictive Coding. (arXiv:2004.14878v3 [cs.CV] UPDATED)
    Predictive coding, currently a highly influential theory in neuroscience, has not been widely adopted in machine learning yet. In this work, we transform the seminal model of Rao and Ballard (1999) into a modern deep learning framework while remaining maximally faithful to the original schema. The resulting network we propose (PreCNet) is tested on a widely used next frame video prediction benchmark, which consists of images from an urban environment recorded from a car-mounted camera, and achieves state-of-the-art performance. Performance on all measures (MSE, PSNR, SSIM) was further improved when a larger training set (2M images from BDD100k), pointing to the limitations of the KITTI training set. This work demonstrates that an architecture carefully based in a neuroscience model, without being explicitly tailored to the task at hand, can exhibit exceptional performance.
    Combining Variational Autoencoders and Physical Bias for Improved Microscopy Data Analysis. (arXiv:2302.04216v1 [cs.LG])
    Electron and scanning probe microscopy produce vast amounts of data in the form of images or hyperspectral data, such as EELS or 4D STEM, that contain information on a wide range of structural, physical, and chemical properties of materials. To extract valuable insights from these data, it is crucial to identify physically separate regions in the data, such as phases, ferroic variants, and boundaries between them. In order to derive an easily interpretable feature analysis, combining with well-defined boundaries in a principled and unsupervised manner, here we present a physics augmented machine learning method which combines the capability of Variational Autoencoders to disentangle factors of variability within the data and the physics driven loss function that seeks to minimize the total length of the discontinuities in images corresponding to latent representations. Our method is applied to various materials, including NiO-LSMO, BiFeO3, and graphene. The results demonstrate the effectiveness of our approach in extracting meaningful information from large volumes of imaging data. The fully notebook containing implementation of the code and analysis workflow is available at https://github.com/arpanbiswas52/PaperNotebooks
    The Test of Tests: A Framework For Differentially Private Hypothesis Testing. (arXiv:2302.04260v1 [stat.ME])
    We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework of this type, as well as to several individually-designed private hypothesis tests. Our framework is higher power than other generic solutions and at least competitive with (and often better than) individually-designed tests.
    Graph Representation Learning via Aggregation Enhancement. (arXiv:2201.12843v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have become a powerful tool for processing graph-structured data but still face challenges in effectively aggregating and propagating information between layers, which limits their performance. We tackle this problem with the kernel regression (KR) approach, using KR loss as the primary loss in self-supervised settings or as a regularization term in supervised settings. We show substantial performance improvements compared to state-of-the-art in both scenarios on multiple transductive and inductive node classification datasets, especially for deep networks. As opposed to mutual information (MI), KR loss is convex and easy to estimate in high-dimensional cases, even though it indirectly maximizes the MI between its inputs. Our work highlights the potential of KR to advance the field of graph representation learning and enhance the performance of GNNs. The code to reproduce our experiments is available at https://github.com/Anonymous1252022/KR_for_GNNs
    A Multimodal Sensing Ring for Quantification of Scratch Intensity. (arXiv:2302.03813v1 [cs.LG])
    An objective measurement of the debilitating symptom, chronic itch, is necessary for improvements in patient care for numerous medical conditions. While wearable devices have shown promise for scratch detection, they are currently unable to estimate scratch intensity, preventing a comprehensive understanding of the effect of itch on an individual. In this work, we present a framework for the estimation of scratch intensity in addition to scratch detection consisting of a multimodal wearable ring device and machine learning algorithms for regression of scratch intensity on a 0-600 mW mechanical power scale that can be mapped to a 0-10 continuous scale. We evaluate the performance of our algorithms on 20 individuals using Leave One Subject Out (LOSO) Cross Validation (CV) and using data from 14 additional participants, we show that our algorithms achieve clinically-relevant discrimination of scratching intensity levels. This work demonstrates that a finger-worn device can provide multidimensional, objective, real-time measures for the action of scratching.  ( 2 min )
    MMA-RNN: A Multi-level Multi-task Attention-based Recurrent Neural Network for Discrimination and Localization of Atrial Fibrillation. (arXiv:2302.03731v1 [cs.LG])
    The automatic detection of atrial fibrillation based on electrocardiograph (ECG) signals has received wide attention both clinically and practically. It is challenging to process ECG signals with cyclical pattern, varying length and unstable quality due to noise and distortion. Besides, there has been insufficient research on separating persistent atrial fibrillation from paroxysmal atrial fibrillation, and little discussion on locating the onsets and end points of AF episodes. It is even more arduous to perform well on these two distinct but interrelated tasks, while avoiding the mistakes inherent from stage-by-stage approaches. This paper proposes the Multi-level Multi-task Attention-based Recurrent Neural Network for three-class discrimination on patients and localization of the exact timing of AF episodes. Our model captures three-level sequential features based on a hierarchical architecture utilizing Bidirectional Long and Short-Term Memory Network (Bi-LSTM) and attention layers, and accomplishes the two tasks simultaneously with a multi-head classifier. The model is designed as an end-to-end framework to enhance information interaction and reduce error accumulation. Finally, we conduct experiments on CPSC 2021 dataset and the result demonstrates the superior performance of our method, indicating the potential application of MMA-RNN to wearable mobile devices for routine AF monitoring and early diagnosis.
    Fundamental Performance Limits for Sensor-Based Robot Control and Policy Learning. (arXiv:2202.00129v3 [cs.RO] UPDATED)
    Our goal is to develop theory and algorithms for establishing fundamental limits on performance for a given task imposed by a robot's sensors. In order to achieve this, we define a quantity that captures the amount of task-relevant information provided by a sensor. Using a novel version of the generalized Fano inequality from information theory, we demonstrate that this quantity provides an upper bound on the highest achievable expected reward for one-step decision making tasks. We then extend this bound to multi-step problems via a dynamic programming approach. We present algorithms for numerically computing the resulting bounds, and demonstrate our approach on three examples: (i) the lava problem from the literature on partially observable Markov decision processes, (ii) an example with continuous state and observation spaces corresponding to a robot catching a freely-falling object, and (iii) obstacle avoidance using a depth sensor with non-Gaussian noise. We demonstrate the ability of our approach to establish strong limits on achievable performance for these problems by comparing our upper bounds with achievable lower bounds (computed by synthesizing or learning concrete control policies).
    Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective. (arXiv:2302.03751v1 [cs.CV])
    Vision transformer (ViT) is an attention neural network architecture that is shown to be effective for computer vision tasks. However, compared to ResNet-18 with a similar number of parameters, ViT has a significantly lower evaluation accuracy when trained on small datasets. To facilitate studies in related fields, we provide a visual intuition to help understand why it is the case. We first compare the performance of the two models and confirm that ViT has less accuracy than ResNet-18 when trained on small datasets. We then interpret the results by showing attention map visualization for ViT and feature map visualization for ResNet-18. The difference is further analyzed through a representation similarity perspective. We conclude that the representation of ViT trained on small datasets is hugely different from ViT trained on large datasets, which may be the reason why the performance drops a lot on small datasets.  ( 2 min )
    Evaluation of Interpretability for Deep Learning algorithms in EEG Emotion Recognition: A case study in Autism. (arXiv:2111.13208v5 [eess.SP] UPDATED)
    Current models on Explainable Artificial Intelligence (XAI) have shown an evident and quantified lack of reliability for measuring feature-relevance when statistically entangled features are proposed for training deep classifiers. There has been an increase in the application of Deep Learning in clinical trials to predict early diagnosis of neuro-developmental disorders, such as Autism Spectrum Disorder (ASD). However, the inclusion of more reliable saliency-maps to obtain more trustworthy and interpretable metrics using neural activity features is still insufficiently mature for practical applications in diagnostics or clinical trials. Moreover, in ASD research the inclusion of deep classifiers that use neural measures to predict viewed facial emotions is relatively unexplored. Therefore, in this study we propose the evaluation of a Convolutional Neural Network (CNN) for electroencephalography (EEG)-based facial emotion recognition decoding complemented with a novel RemOve-And-Retrain (ROAR) methodology to recover highly relevant features used in the classifier. Specifically, we compare well-known relevance maps such as Layer-Wise Relevance Propagation (LRP), PatternNet, Pattern-Attribution, and Smooth-Grad Squared. This study is the first to consolidate a more transparent feature-relevance calculation for a successful EEG-based facial emotion recognition using a within-subject-trained CNN in typically-developed and ASD individuals.
    DropMessage: Unifying Random Dropping for Graph Neural Networks. (arXiv:2204.10037v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are powerful tools for graph representation learning. Despite their rapid development, GNNs also face some challenges, such as over-fitting, over-smoothing, and non-robustness. Previous works indicate that these problems can be alleviated by random dropping methods, which integrate augmented data into models by randomly masking parts of the input. However, some open problems of random dropping on GNNs remain to be solved. First, it is challenging to find a universal method that are suitable for all cases considering the divergence of different datasets and models. Second, augmented data introduced to GNNs causes the incomplete coverage of parameters and unstable training process. Third, there is no theoretical analysis on the effectiveness of random dropping methods on GNNs. In this paper, we propose a novel random dropping method called DropMessage, which performs dropping operations directly on the propagated messages during the message-passing process. More importantly, we find that DropMessage provides a unified framework for most existing random dropping methods, based on which we give theoretical analysis of their effectiveness. Furthermore, we elaborate the superiority of DropMessage: it stabilizes the training process by reducing sample variance; it keeps information diversity from the perspective of information theory, enabling it become a theoretical upper bound of other methods. To evaluate our proposed method, we conduct experiments that aims for multiple tasks on five public datasets and two industrial datasets with various backbone models. The experimental results show that DropMessage has the advantages of both effectiveness and generalization, and can significantly alleviate the problems mentioned above.
    Learning How to Infer Partial MDPs for In-Context Adaptation and Exploration. (arXiv:2302.04250v1 [cs.LG])
    To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising approach, but it requires Bayesian inference and dynamic programming, which often involve unknowns (e.g., a prior) and costly computations. To address these difficulties, we use a transformer to learn an inference process from training tasks and consider a hypothesis space of partial models, represented as small Markov decision processes that are cheap for dynamic programming. In our version of the Symbolic Alchemy benchmark, our method's adaptation speed and exploration-exploitation balance approach those of an exact posterior sampling oracle. We also show that even though partial models exclude relevant information from the environment, they can nevertheless lead to good policies.
    Adversarial Prompting for Black Box Foundation Models. (arXiv:2302.04237v1 [cs.LG])
    Prompting interfaces allow users to quickly adjust the output of generative models in both vision and language. However, small changes and design choices in the prompt can lead to significant differences in the output. In this work, we develop a black-box framework for generating adversarial prompts for unstructured image and text generation. These prompts, which can be standalone or prepended to benign prompts, induce specific behaviors into the generative process, such as generating images of a particular object or biasing the frequency of specific letters in the generated text.
    Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection. (arXiv:2302.03857v1 [cs.LG])
    Adversarial contrastive learning (ACL) does not require expensive data annotations but outputs a robust representation that withstands adversarial attacks and also generalizes to a wide range of downstream tasks. However, ACL needs tremendous running time to generate the adversarial variants of all training data, which limits its scalability to large datasets. To speed up ACL, this paper proposes a robustness-aware coreset selection (RCS) method. RCS does not require label information and searches for an informative subset that minimizes a representational divergence, which is the distance of the representation between natural data and their virtual adversarial variants. The vanilla solution of RCS via traversing all possible subsets is computationally prohibitive. Therefore, we theoretically transform RCS into a surrogate problem of submodular maximization, of which the greedy search is an efficient solution with an optimality guarantee for the original problem. Empirically, our comprehensive results corroborate that RCS can speed up ACL by a large margin without significantly hurting the robustness and standard transferability. Notably, to the best of our knowledge, we are the first to conduct ACL efficiently on the large-scale ImageNet-1K dataset to obtain an effective robust representation via RCS.  ( 2 min )
    Connections and Equivalences between the Nystr\"om Method and Sparse Variational Gaussian Processes. (arXiv:2106.01121v2 [stat.ML] UPDATED)
    We investigate the connections between sparse approximation methods for making kernel methods and Gaussian processes (GPs) scalable to large-scale data, focusing on the Nystr\"om method and the Sparse Variational Gaussian Processes (SVGP). While sparse approximation methods for GPs and kernel methods share some algebraic similarities, the literature lacks a deep understanding of how and why they are related. This may pose an obstacle to the communications between the GP and kernel communities, making it difficult to transfer results from one side to the other. Our motivation is to remove this obstacle, by clarifying the connections between the sparse approximations for GPs and kernel methods. In this work, we study the two popular approaches, the Nystr\"om and SVGP approximations, in the context of a regression problem, and establish various connections and equivalences between them. In particular, we provide an RKHS interpretation of the SVGP approximation, and show that the Evidence Lower Bound of the SVGP contains the objective function of the Nystr\"om approximation, revealing the origin of the algebraic equivalence between the two approaches. We also study recently established convergence results for the SVGP and how they are related to the approximation quality of the Nystr\"om method.
    PASTA: Pessimistic Assortment Optimization. (arXiv:2302.03821v1 [cs.LG])
    We consider a class of assortment optimization problems in an offline data-driven setting. A firm does not know the underlying customer choice model but has access to an offline dataset consisting of the historically offered assortment set, customer choice, and revenue. The objective is to use the offline dataset to find an optimal assortment. Due to the combinatorial nature of assortment optimization, the problem of insufficient data coverage is likely to occur in the offline dataset. Therefore, designing a provably efficient offline learning algorithm becomes a significant challenge. To this end, we propose an algorithm referred to as Pessimistic ASsortment opTimizAtion (PASTA for short) designed based on the principle of pessimism, that can correctly identify the optimal assortment by only requiring the offline data to cover the optimal assortment under general settings. In particular, we establish a regret bound for the offline assortment optimization problem under the celebrated multinomial logit model. We also propose an efficient computational procedure to solve our pessimistic assortment optimization problem. Numerical studies demonstrate the superiority of the proposed method over the existing baseline method.  ( 2 min )
    TVAE: Triplet-Based Variational Autoencoder using Metric Learning. (arXiv:1802.04403v3 [stat.ML] UPDATED)
    Deep metric learning has been demonstrated to be highly effective in learning semantic representation and encoding information that can be used to measure data similarity, by relying on the embedding learned from metric learning. At the same time, variational autoencoder (VAE) has widely been used to approximate inference and proved to have a good performance for directed probabilistic models. However, for traditional VAE, the data label or feature information are intractable. Similarly, traditional representation learning approaches fail to represent many salient aspects of the data. In this project, we propose a novel integrated framework to learn latent embedding in VAE by incorporating deep metric learning. The features are learned by optimizing a triplet loss on the mean vectors of VAE in conjunction with standard evidence lower bound (ELBO) of VAE. This approach, which we call Triplet based Variational Autoencoder (TVAE), allows us to capture more fine-grained information in the latent embedding. Our model is tested on MNIST data set and achieves a high triplet accuracy of 95.60% while the traditional VAE (Kingma & Welling, 2013) achieves triplet accuracy of 75.08%.
    PFGM++: Unlocking the Potential of Physics-Inspired Generative Models. (arXiv:2302.04265v1 [cs.LG])
    We introduce a new family of physics-inspired generative models termed PFGM++ that unifies diffusion models and Poisson Flow Generative Models (PFGM). These models realize generative trajectories for $N$ dimensional data by embedding paths in $N{+}D$ dimensional space while still controlling the progression with a simple scalar norm of the $D$ additional variables. The new models reduce to PFGM when $D{=}1$ and to diffusion models when $D{\to}\infty$. The flexibility of choosing $D$ allows us to trade off robustness against rigidity as increasing $D$ results in more concentrated coupling between the data and the additional variable norms. We dispense with the biased large batch field targets used in PFGM and instead provide an unbiased perturbation-based objective similar to diffusion models. To explore different choices of $D$, we provide a direct alignment method for transferring well-tuned hyperparameters from diffusion models ($D{\to} \infty$) to any finite $D$ values. Our experiments show that models with finite $D$ can be superior to previous state-of-the-art diffusion models on CIFAR-10/FFHQ $64{\times}64$ datasets, with FID scores of $1.91/2.43$ when $D{=}2048/128$. In addition, we demonstrate that models with smaller $D$ exhibit improved robustness against modeling errors. Code is available at https://github.com/Newbeeer/pfgmpp
    Machine learning classification of non-Markovian noise disturbing quantum dynamics. (arXiv:2101.03221v3 [quant-ph] UPDATED)
    In this paper machine learning and artificial neural network models are proposed for the classification of external noise sources affecting a given quantum dynamics. For this purpose, we train and then validate support vector machine, multi-layer perceptron and recurrent neural network models with different complexity and accuracy, to solve supervised binary classification problems. As a result, we demonstrate the high efficacy of such tools in classifying noisy quantum dynamics using simulated data sets from different realizations of the quantum system dynamics. In addition, we show that for a successful classification one just needs to measure, in a sequence of discrete time instants, the probabilities that the analysed quantum system is in one of the allowed positions or energy configurations. Albeit the training of machine learning models is here performed on synthetic data, our approach is expected to find application in experimental schemes, as e.g. for the noise benchmarking of noisy intermediate-scale quantum devices.
    DIFF2: Differential Private Optimization via Gradient Differences for Nonconvex Distributed Learning. (arXiv:2302.03884v1 [cs.LG])
    Differential private optimization for nonconvex smooth objective is considered. In the previous work, the best known utility bound is $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ in terms of the squared full gradient norm, which is achieved by Differential Private Gradient Descent (DP-GD) as an instance, where $n$ is the sample size, $d$ is the problem dimensionality and $\varepsilon_\mathrm{DP}$ is the differential privacy parameter. To improve the best known utility bound, we propose a new differential private optimization framework called \emph{DIFF2 (DIFFerential private optimization via gradient DIFFerences)} that constructs a differential private global gradient estimator with possibly quite small variance based on communicated \emph{gradient differences} rather than gradients themselves. It is shown that DIFF2 with a gradient descent subroutine achieves the utility of $\widetilde O(d^{2/3}/(n\varepsilon_\mathrm{DP})^{4/3})$, which can be significantly better than the previous one in terms of the dependence on the sample size $n$. To the best of our knowledge, this is the first fundamental result to improve the standard utility $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ for nonconvex objectives. Additionally, a more computational and communication efficient subroutine is combined with DIFF2 and its theoretical analysis is also given. Numerical experiments are conducted to validate the superiority of DIFF2 framework.
    Operator Shifting for Model-based Policy Evaluation. (arXiv:2110.12658v3 [cs.LG] UPDATED)
    In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.  ( 2 min )
    Diagnosing and Rectifying Vision Models using Language. (arXiv:2302.04269v1 [cs.LG])
    Recent multi-modal contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers, by leveraging the rich information in large-scale image-caption datasets. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. The traditional process of diagnosing model behaviors in deployment settings involves labor-intensive data acquisition and annotation. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors, without requiring any visual data. Through a combination of theoretical explanation and empirical verification, we present conditions under which classifiers trained on embeddings from one modality can be equivalently applied to embeddings from another modality. On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier.
    Generalizing Neural Wave Functions. (arXiv:2302.04168v1 [cs.LG])
    Recent neural network-based wave functions have achieved state-of-the-art accuracies in modeling ab-initio ground-state potential energy surface. However, these networks can only solve different spatial arrangements of the same set of atoms. To overcome this limitation, we present Graph-learned Orbital Embeddings (Globe), a neural network-based reparametrization method that can adapt neural wave functions to different molecules. We achieve this by combining a localization method for molecular orbitals with spatial message-passing networks. Further, we propose a locality-driven wave function, the Molecular Oribtal Network (Moon), tailored to solving Schr\"odinger equations of different molecules jointly. In our experiments, we find Moon requiring 8 times fewer steps to converge to similar accuracies as previous methods when trained on different molecules jointly while Globe enabling the transfer from smaller to larger molecules. Further, our analysis shows that Moon converges similarly to recent transformer-based wave functions on larger molecules. In both the computational chemistry and machine learning literature, we are the first to demonstrate that a single wave function can solve the Schr\"odinger equation of molecules with different atoms jointly.
    Exploratory Analysis of Federated Learning Methods with Differential Privacy on MIMIC-III. (arXiv:2302.04208v1 [cs.LG])
    Background: Federated learning methods offer the possibility of training machine learning models on privacy-sensitive data sets, which cannot be easily shared. Multiple regulations pose strict requirements on the storage and usage of healthcare data, leading to data being in silos (i.e. locked-in at healthcare facilities). The application of federated algorithms on these datasets could accelerate disease diagnostic, drug development, as well as improve patient care. Methods: We present an extensive evaluation of the impact of different federation and differential privacy techniques when training models on the open-source MIMIC-III dataset. We analyze a set of parameters influencing a federated model performance, namely data distribution (homogeneous and heterogeneous), communication strategies (communication rounds vs. local training epochs), federation strategies (FedAvg vs. FedProx). Furthermore, we assess and compare two differential privacy (DP) techniques during model training: a stochastic gradient descent-based differential privacy algorithm (DP-SGD), and a sparse vector differential privacy technique (DP-SVT). Results: Our experiments show that extreme data distributions across sites (imbalance either in the number of patients or the positive label ratios between sites) lead to a deterioration of model performance when trained using the FedAvg strategy. This issue is resolved when using FedProx with the use of appropriate hyperparameter tuning. Furthermore, the results show that both differential privacy techniques can reach model performances similar to those of models trained without DP, however at the expense of a large quantifiable privacy leakage. Conclusions: We evaluate empirically the benefits of two federation strategies and propose optimal strategies for the choice of parameters when using differential privacy techniques.
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v5 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    IRTCI: Item Response Theory for Categorical Imputation. (arXiv:2302.04165v1 [stat.ML])
    Most datasets suffer from partial or complete missing values, which has downstream limitations on the available models on which to test the data and on any statistical inferences that can be made from the data. Several imputation techniques have been designed to replace missing data with stand in values. The various approaches have implications for calculating clinical scores, model building and model testing. The work showcased here offers a novel means for categorical imputation based on item response theory (IRT) and compares it against several methodologies currently used in the machine learning field including k-nearest neighbors (kNN), multiple imputed chained equations (MICE) and Amazon Web Services (AWS) deep learning method, Datawig. Analyses comparing these techniques were performed on three different datasets that represented ordinal, nominal and binary categories. The data were modified so that they also varied on both the proportion of data missing and the systematization of the missing data. Two different assessments of performance were conducted: accuracy in reproducing the missing values, and predictive performance using the imputed data. Results demonstrated that the new method, Item Response Theory for Categorical Imputation (IRTCI), fared quite well compared to currently used methods, outperforming several of them in many conditions. Given the theoretical basis for the new approach, and the unique generation of probabilistic terms for determining category belonging for missing cells, IRTCI offers a viable alternative to current approaches.
    The Hardware Impact of Quantization and Pruning for Weights in Spiking Neural Networks. (arXiv:2302.04174v1 [cs.LG])
    Energy efficient implementations and deployments of Spiking neural networks (SNNs) have been of great interest due to the possibility of developing artificial systems that can achieve the computational powers and energy efficiency of the biological brain. Efficient implementations of SNNs on modern digital hardware are also inspired by advances in machine learning and deep neural networks (DNNs). Two techniques widely employed in the efficient deployment of DNNs -- the quantization and pruning of parameters, can both compress the model size, reduce memory footprints, and facilitate low-latency execution. The interaction between quantization and pruning and how they might impact model performance on SNN accelerators is currently unknown. We study various combinations of pruning and quantization in isolation, cumulatively, and simultaneously (jointly) to a state-of-the-art SNN targeting gesture recognition for dynamic vision sensor cameras (DVS). We show that this state-of-the-art model is amenable to aggressive parameter quantization, not suffering from any loss in accuracy down to ternary weights. However, pruning only maintains iso-accuracy up to 80% sparsity, which results in 45% more energy than the best quantization on our architectural model. Applying both pruning and quantization can result in an accuracy loss to offer a favourable trade-off on the energy-accuracy Pareto-frontier for the given hardware configuration.
    Attending to Graph Transformers. (arXiv:2302.04181v1 [cs.LG])
    Recently, transformer architectures for graphs emerged as an alternative to established techniques for machine learning with graphs, such as graph neural networks. So far, they have shown promising empirical results, e.g., on molecular prediction datasets, often attributed to their ability to circumvent graph neural networks' shortcomings, such as over-smoothing and over-squashing. Here, we derive a taxonomy of graph transformer architectures, bringing some order to this emerging field. We overview their theoretical properties, survey structural and positional encodings, and discuss extensions for important graph classes, e.g., 3D molecular graphs. Empirically, we probe how well graph transformers can recover various graph properties, how well they can deal with heterophilic graphs, and to what extent they prevent over-squashing. Further, we outline open challenges and research direction to stimulate future work. Our code is available at https://github.com/luis-mueller/probing-graph-transformers.
    Combining self-labeling and demand based active learning for non-stationary data streams. (arXiv:2302.04141v1 [cs.LG])
    Learning from non-stationary data streams is a research direction that gains increasing interest as more data in form of streams becomes available, for example from social media, smartphones, or industrial process monitoring. Most approaches assume that the ground truth of the samples becomes available (possibly with some delay) and perform supervised online learning in the test-then-train scheme. While this assumption might be valid in some scenarios, it does not apply to all settings. In this work, we focus on scarcely labeled data streams and explore the potential of self-labeling in gradually drifting data streams. We formalize this setup and propose a novel online $k$-nn classifier that combines self-labeling and demand-based active learning.
    Can Physics-Informed Neural Networks beat the Finite Element Method?. (arXiv:2302.04107v1 [math.NA])
    Partial differential equations play a fundamental role in the mathematical modelling of many processes and systems in physical, biological and other sciences. To simulate such processes and systems, the solutions of PDEs often need to be approximated numerically. The finite element method, for instance, is a usual standard methodology to do so. The recent success of deep neural networks at various approximation tasks has motivated their use in the numerical solution of PDEs. These so-called physics-informed neural networks and their variants have shown to be able to successfully approximate a large range of partial differential equations. So far, physics-informed neural networks and the finite element method have mainly been studied in isolation of each other. In this work, we compare the methodologies in a systematic computational study. Indeed, we employ both methods to numerically solve various linear and nonlinear partial differential equations: Poisson in 1D, 2D, and 3D, Allen-Cahn in 1D, semilinear Schr\"odinger in 1D and 2D. We then compare computational costs and approximation accuracies. In terms of solution time and accuracy, physics-informed neural networks have not been able to outperform the finite element method in our study. In some experiments, they were faster at evaluating the solved PDE.
    Predicting the performance of hybrid ventilation in buildings using a multivariate attention-based biLSTM Encoder-Decoder neural network. (arXiv:2302.04126v1 [cs.LG])
    Hybrid ventilation (coupling natural and mechanical ventilation) is an energy-efficient solution to provide fresh air for most climates, given that it has a reliable control system. To operate such systems optimally, a high-fidelity control-oriented model is required. It should enable near-real time forecast of the indoor air temperature and humidity based on operational conditions such as window opening and HVAC schedules. However, widely used physics-based simulation models (i.e., white-box models) are labour-intensive and computationally expensive. Alternatively, black-box models based on artificial neural networks can be trained to be good estimators for building dynamics. This paper investigates the capabilities of a multivariate multi-head attention-based long short-term memory (LSTM) encoder-decoder neural network to predict indoor air conditions of a building equipped with hybrid ventilation. The deep neural network used for this study aims to predict indoor air temperature dynamics when a window is opened and closed, respectively. Training and test data were generated from detailed multi-zone office building model (EnergyPlus). The deep neural network is able to accurately predict indoor air temperature of five zones whenever a window was opened and closed.
    DynGFN: Bayesian Dynamic Causal Discovery using Generative Flow Networks. (arXiv:2302.04178v1 [cs.LG])
    Learning the causal structure of observable variables is a central focus for scientific discovery. Bayesian causal discovery methods tackle this problem by learning a posterior over the set of admissible graphs given our priors and observations. Existing methods primarily consider observations from static systems and assume the underlying causal structure takes the form of a directed acyclic graph (DAG). In settings with dynamic feedback mechanisms that regulate the trajectories of individual variables, this acyclicity assumption fails unless we account for time. We focus on learning Bayesian posteriors over cyclic graphs and treat causal discovery as a problem of sparse identification of a dynamical system. This imposes a natural temporal causal order between variables and captures cyclic feedback loops through time. Under this lens, we propose a new framework for Bayesian causal discovery for dynamical systems and present a novel generative flow network architecture (DynGFN) tailored for this task. Our results indicate that DynGFN learns posteriors that better encapsulate the distributions over admissible cyclic causal structures compared to counterpart state-of-the-art approaches.
    Two-step hyperparameter optimization method: Accelerating hyperparameter search by using a fraction of a training dataset. (arXiv:2302.03845v1 [cs.LG])
    Hyperparameter optimization (HPO) can be an important step in machine learning model development, but our common practice is archaic -- primarily using a manual or grid search. This is partly because adopting an advanced HPO algorithm entails extra complexity to workflow and longer computation time. This imposes a significant hurdle to machine learning (ML) applications since the choice of suboptimal hyperparameters limits the performance of ML models, ultimately failing to harness the full potential of ML techniques. In this article, we present a two-step HPO method as a strategy to minimize compute and wait time as a lesson learned during applied ML parameterization work. A preliminary evaluation of hyperparameters is first conducted on a small subset of a training dataset, then top-performing candidate models are re-evaluated after retraining with an entire training dataset. This two-step HPO method can be applied to any HPO search algorithm, and we argue it has attractive efficiencies. As a case study, we present our recent application of the two-step HPO method to the development of neural network emulators of aerosol activation. Using only 5% of a training dataset in the initial step is sufficient to find optimal hyperparameter configurations from much more extensive sampling. The benefits of HPO are then revealed by analysis of hyperparameters and model performance, revealing a minimal model complexity required to achieve the best performance, and the diversity of top-performing models harvested from the HPO process allows us to choose a high-performing model with a low inference cost for efficient use in GCMs.  ( 2 min )
    Improving the Model Consistency of Decentralized Federated Learning. (arXiv:2302.04083v1 [cs.LG])
    To mitigate the privacy leakages and communication burdens of Federated Learning (FL), decentralized FL (DFL) discards the central server and each client only communicates with its neighbors in a decentralized communication network. However, existing DFL suffers from high inconsistency among local clients, which results in severe distribution shift and inferior performance compared with centralized FL (CFL), especially on heterogeneous data or sparse communication topology. To alleviate this issue, we propose two DFL algorithms named DFedSAM and DFedSAM-MGS to improve the performance of DFL. Specifically, DFedSAM leverages gradient perturbation to generate local flat models via Sharpness Aware Minimization (SAM), which searches for models with uniformly low loss values. DFedSAM-MGS further boosts DFedSAM by adopting Multiple Gossip Steps (MGS) for better model consistency, which accelerates the aggregation of local flat models and better balances communication complexity and generalization. Theoretically, we present improved convergence rates $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{1}{K^{1/2}T^{3/2}(1-\lambda)^2}\big)$ and $\small \mathcal{O}\big(\frac{1}{\sqrt{KT}}+\frac{1}{T}+\frac{\lambda^Q+1}{K^{1/2}T^{3/2}(1-\lambda^Q)^2}\big)$ in non-convex setting for DFedSAM and DFedSAM-MGS, respectively, where $1-\lambda$ is the spectral gap of gossip matrix and $Q$ is the number of MGS. Empirically, our methods can achieve competitive performance compared with CFL methods and outperform existing DFL methods.
    ZipLM: Hardware-Aware Structured Pruning of Language Models. (arXiv:2302.04089v1 [cs.LG])
    The breakthrough performance of large language models (LLMs) comes with large computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a new structured compression approach for LLMs, called ZipLM, which provides state-of-the-art compression-vs-accuracy results, while guaranteeing to match a set of (achievable) target speedups on any given target hardware. Specifically, given a task, a model, an inference environment, as well as a set of speedup targets, ZipLM identifies and removes redundancies in the model through iterative structured shrinking of the model's weight matrices. Importantly, ZipLM works in both, the post-training/one-shot and the gradual compression setting, where it produces a set of accurate models in a single run, making it highly-efficient in practice. Our approach is based on new structured pruning and knowledge distillation techniques, and consistently outperforms prior structured compression methods in terms of accuracy-versus-speedup in experiments on BERT- and GPT-family models. In particular, when compressing GPT2 model, it outperforms DistilGPT2 while being 60% smaller and 30% faster. Further, ZipLM matches performance of heavily optimized MobileBERT model, obtained via extensive architecture search, by simply pruning the baseline BERT-large architecture, and outperforms all prior BERT-base compression techniques like CoFi, MiniLM and TinyBERT.
    Automating Code-Related Tasks Through Transformers: The Impact of Pre-training. (arXiv:2302.04048v1 [cs.SE])
    Transformers have gained popularity in the software engineering (SE) literature. These deep learning models are usually pre-trained through a self-supervised objective, meant to provide the model with basic knowledge about a language of interest (e.g., Java). A classic pre-training objective is the masked language model (MLM), in which a percentage of tokens from the input (e.g., a Java method) is masked, with the model in charge of predicting them. Once pre-trained, the model is then fine-tuned to support the specific downstream task of interest (e.g., code summarization). While there is evidence suggesting the boost in performance provided by pre-training, little is known about the impact of the specific pre-training objective(s) used. Indeed, MLM is just one of the possible pre-training objectives and recent work from the natural language processing field suggest that pre-training objectives tailored for the specific downstream task of interest may substantially boost the model's performance. In this study, we focus on the impact of pre-training objectives on the performance of transformers when automating code-related tasks. We start with a systematic literature review aimed at identifying the pre-training objectives used in SE. Then, we pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in SE; and (ii) pre-training objectives tailored to specific code-related tasks subject of our experimentation, namely bug-fixing, code summarization, and code completion. We also compare the pre-trained models with non pre-trained ones. Our results show that: (i) pre-training helps in boosting performance only if the amount of fine-tuning data available is small; (ii) the MLM objective is usually sufficient to maximize the prediction performance of the model, even when comparing it with pre-training objectives specialized for the downstream task at hand.
    Local Law 144: A Critical Analysis of Regression Metrics. (arXiv:2302.04119v1 [cs.LG])
    The use of automated decision tools in recruitment has received an increasing amount of attention. In November 2021, the New York City Council passed a legislation (Local Law 144) that mandates bias audits of Automated Employment Decision Tools. From 15th April 2023, companies that use automated tools for hiring or promoting employees are required to have these systems audited by an independent entity. Auditors are asked to compute bias metrics that compare outcomes for different groups, based on sex/gender and race/ethnicity categories at a minimum. Local Law 144 proposes novel bias metrics for regression tasks (scenarios where the automated system scores candidates with a continuous range of values). A previous version of the legislation proposed a bias metric that compared the mean scores of different groups. The new revised bias metric compares the proportion of candidates in each group that falls above the median. In this paper, we argue that both metrics fail to capture distributional differences over the whole domain, and therefore cannot reliably detect bias. We first introduce two metrics, as possible alternatives to the legislation metrics. We then compare these metrics over a range of theoretical examples, for which the legislation proposed metrics seem to underestimate bias. Finally, we study real data and show that the legislation metrics can similarly fail in a real-world recruitment application.
    DeepVATS: Deep Visual Analytics for Time Series. (arXiv:2302.03858v1 [cs.LG])
    The field of Deep Visual Analytics (DVA) has recently arisen from the idea of developing Visual Interactive Systems supported by deep learning techniques, in order to provide them with large-scale data processing capabilities and to unify their implementation across different data modalities and domains of application. In this paper we present DeepVATS, an open-source tool that brings the field of DVA into time series data. DeepVATS trains, in a self-supervised way, a masked time series autoencoder that reconstructs patches of a time series, and projects the knowledge contained in the embeddings of that model in an interactive plot, from which time series patterns and anomalies emerge and can be easily spotted. The tool has been tested on both synthetic and real datasets, and its code is publicly available on https://github.com/vrodriguezf/deepvats  ( 2 min )
    Decentralized Riemannian Algorithm for Nonconvex Minimax Problems. (arXiv:2302.03825v1 [cs.LG])
    The minimax optimization over Riemannian manifolds (possibly nonconvex constraints) has been actively applied to solve many problems, such as robust dimensionality reduction and deep neural networks with orthogonal weights (Stiefel manifold). Although many optimization algorithms for minimax problems have been developed in the Euclidean setting, it is difficult to convert them into Riemannian cases, and algorithms for nonconvex minimax problems with nonconvex constraints are even rare. On the other hand, to address the big data challenges, decentralized (serverless) training techniques have recently been emerging since they can reduce communications overhead and avoid the bottleneck problem on the server node. Nonetheless, the algorithm for decentralized Riemannian minimax problems has not been studied. In this paper, we study the distributed nonconvex-strongly-concave minimax optimization problem over the Stiefel manifold and propose both deterministic and stochastic minimax methods. The local model is non-convex strong-concave and the Steifel manifold is a non-convex set. The global function is represented as the finite sum of local functions. For the deterministic setting, we propose DRGDA and prove that our deterministic method achieves a gradient complexity of $O( \epsilon^{-2})$ under mild conditions. For the stochastic setting, we propose DRSGDA and prove that our stochastic method achieves a gradient complexity of $O(\epsilon^{-4})$. The DRGDA and DRSGDA are the first algorithms for distributed minimax optimization with nonconvex constraints with exact convergence. Extensive experimental results on the Deep Neural Networks (DNNs) training over the Stiefel manifold demonstrate the efficiency of our algorithms.
    A Model for Forecasting Air Quality Index in Port Harcourt Nigeria Using Bi-LSTM Algorithm. (arXiv:2302.03930v1 [cs.LG])
    The release of toxic gases by industries, emissions from vehicles, and an increase in the concentration of harmful gases and particulate matter in the atmosphere are all contributing factors to the deterioration of the quality of the air. Factors such as industries, urbanization, population growth, and the increased use of vehicles contribute to the rapid increase in pollution levels, which can adversely impact human health. This paper presents a model for forecasting the air quality index in Nigeria using the Bi-directional LSTM model. The air pollution data was downloaded from an online database (UCL). The dataset was pre-processed using both pandas tools in python. The pre-processed result was used as input features in training a Bi-LSTM model in making future forecasts of the values of the particulate matter Pm2.5, and Pm10. The Bi-LSTM model was evaluated using some evaluation parameters such as mean square error, mean absolute error, absolute mean square, and R^2 square. The result of the Bi-LSTM shows a mean square error of 52.99%, relative mean square error of 7.28%, mean absolute error of 3.4%, and R^2 square of 97%. The model. This shows that the model follows a seamless trend in forecasting the air quality in Port Harcourt, Nigeria.  ( 2 min )
    Neural Artistic Style Transfer with Conditional Adversaria. (arXiv:2302.03875v1 [cs.CV])
    A neural artistic style transformation (NST) model can modify the appearance of a simple image by adding the style of a famous image. Even though the transformed images do not look precisely like artworks by the same artist of the respective style images, the generated images are appealing. Generally, a trained NST model specialises in a style, and a single image represents that style. However, generating an image under a new style is a tedious process, which includes full model training. In this paper, we present two methods that step toward the style image independent neural style transfer model. In other words, the trained model could generate semantically accurate generated image under any content, style image input pair. Our novel contribution is a unidirectional-GAN model that ensures the Cyclic consistency by the model architecture.Furthermore, this leads to much smaller model size and an efficient training and validation phase.
    A Scale-Independent Multi-Objective Reinforcement Learning with Convergence Analysis. (arXiv:2302.04179v1 [cs.LG])
    Many sequential decision-making problems need optimization of different objectives which possibly conflict with each other. The conventional way to deal with a multi-task problem is to establish a scalar objective function based on a linear combination of different objectives. However, for the case of having conflicting objectives with different scales, this method needs a trial-and-error approach to properly find proper weights for the combination. As such, in most cases, this approach cannot guarantee an optimal Pareto solution. In this paper, we develop a single-agent scale-independent multi-objective reinforcement learning on the basis of the Advantage Actor-Critic (A2C) algorithm. A convergence analysis is then done for the devised multi-objective algorithm providing a convergence-in-mean guarantee. We then perform some experiments over a multi-task problem to evaluate the performance of the proposed algorithm. Simulation results show the superiority of developed multi-objective A2C approach against the single-objective algorithm.
    Federated Minimax Optimization with Client Heterogeneity. (arXiv:2302.04249v1 [cs.LG])
    Minimax optimization has seen a surge in interest with the advent of modern applications such as GANs, and it is inherently more challenging than simple minimization. The difficulty is exacerbated by the training data residing at multiple edge devices or \textit{clients}, especially when these clients can have heterogeneous datasets and local computation capabilities. We propose a general federated minimax optimization framework that subsumes such settings and several existing methods like Local SGDA. We show that naive aggregation of heterogeneous local progress results in optimizing a mismatched objective function -- a phenomenon previously observed in standard federated minimization. To fix this problem, we propose normalizing the client updates by the number of local steps undertaken between successive communication rounds. We analyze the convergence of the proposed algorithm for classes of nonconvex-concave and nonconvex-nonconcave functions and characterize the impact of heterogeneous client data, partial client participation, and heterogeneous local computations. Our analysis works under more general assumptions on the intra-client noise and inter-client heterogeneity than so far considered in the literature. For all the function classes considered, we significantly improve the existing computation and communication complexity results. Experimental results support our theoretical claims.
    FFHR: Fully and Flexible Hyperbolic Representation for Knowledge Graph Completion. (arXiv:2302.04088v1 [cs.LG])
    Learning hyperbolic embeddings for knowledge graph (KG) has gained increasing attention due to its superiority in capturing hierarchies. However, some important operations in hyperbolic space still lack good definitions, making existing methods unable to fully leverage the merits of hyperbolic space. Specifically, they suffer from two main limitations: 1) existing Graph Convolutional Network (GCN) methods in hyperbolic space rely on tangent space approximation, which would incur approximation error in representation learning, and 2) due to the lack of inner product operation definition in hyperbolic space, existing methods can only measure the plausibility of facts (links) with hyperbolic distance, which is difficult to capture complex data patterns. In this work, we contribute: 1) a Full Poincar\'{e} Multi-relational GCN that achieves graph information propagation in hyperbolic space without requiring any approximation, and 2) a hyperbolic generalization of Euclidean inner product that is beneficial to capture both hierarchical and complex patterns. On this basis, we further develop a \textbf{F}ully and \textbf{F}lexible \textbf{H}yperbolic \textbf{R}epresentation framework (\textbf{FFHR}) that is able to transfer recent Euclidean-based advances to hyperbolic space. We demonstrate it by instantiating FFHR with four representative KGC methods. Extensive experiments on benchmark datasets validate the superiority of our FFHRs over their Euclidean counterparts as well as state-of-the-art hyperbolic embedding methods.
    ED-Batch: Efficient Automatic Batching of Dynamic Neural Networks via Learned Finite State Machines. (arXiv:2302.03851v1 [cs.LG])
    Batching has a fundamental influence on the efficiency of deep neural network (DNN) execution. However, for dynamic DNNs, efficient batching is particularly challenging as the dataflow graph varies per input instance. As a result, state-of-the-art frameworks use heuristics that result in suboptimal batching decisions. Further, batching puts strict restrictions on memory adjacency and can lead to high data movement costs. In this paper, we provide an approach for batching dynamic DNNs based on finite state machines, which enables the automatic discovery of batching policies specialized for each DNN via reinforcement learning. Moreover, we find that memory planning that is aware of the batching policy can save significant data movement overheads, which is automated by a PQ tree-based algorithm we introduce. Experimental results show that our framework speeds up state-of-the-art frameworks by on average 1.15x, 1.39x, and 2.45x for chain-based, tree-based, and lattice-based DNNs across CPU and GPU.  ( 2 min )
    On Generalized Degree Fairness in Graph Neural Networks. (arXiv:2302.03881v1 [cs.LG])
    Conventional graph neural networks (GNNs) are often confronted with fairness issues that may stem from their input, including node attributes and neighbors surrounding a node. While several recent approaches have been proposed to eliminate the bias rooted in sensitive attributes, they ignore the other key input of GNNs, namely the neighbors of a node, which can introduce bias since GNNs hinge on neighborhood structures to generate node representations. In particular, the varying neighborhood structures across nodes, manifesting themselves in drastically different node degrees, give rise to the diverse behaviors of nodes and biased outcomes. In this paper, we first define and generalize the degree bias using a generalized definition of node degree as a manifestation and quantification of different multi-hop structures around different nodes. To address the bias in the context of node classification, we propose a novel GNN framework called Generalized Degree Fairness-centric Graph Neural Network (Deg-FairGNN). Specifically, in each GNN layer, we employ a learnable debiasing function to generate debiasing contexts, which modulate the layer-wise neighborhood aggregation to eliminate the degree bias originating from the diverse degrees among nodes. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our model on both accuracy and fairness metrics.  ( 2 min )
    Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models. (arXiv:2302.03900v1 [cs.CV])
    Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.  ( 2 min )
    Topological Deep Learning: A Review of an Emerging Paradigm. (arXiv:2302.03836v1 [cs.LG])
    Topological data analysis (TDA) provides insight into data shape. The summaries obtained by these methods are principled global descriptions of multi-dimensional data whilst exhibiting stable properties such as robustness to deformation and noise. Such properties are desirable in deep learning pipelines but they are typically obtained using non-TDA strategies. This is partly caused by the difficulty of combining TDA constructs (e.g. barcode and persistence diagrams) with current deep learning algorithms. Fortunately, we are now witnessing a growth of deep learning applications embracing topologically-guided components. In this survey, we review the nascent field of topological deep learning by first revisiting the core concepts of TDA. We then explore how the use of TDA techniques has evolved over time to support deep learning frameworks, and how they can be integrated into different aspects of deep learning. Furthermore, we touch on TDA usage for analyzing existing deep models; deep topological analytics. Finally, we discuss the challenges and future prospects of topological deep learning.  ( 2 min )
    Towards Inferential Reproducibility of Machine Learning Research. (arXiv:2302.04054v1 [cs.LG])
    Reliability of machine learning evaluation -- the consistency of observed evaluation scores across replicated model training runs -- is affected by several sources of nondeterminism which can be regarded as measurement noise. Current tendencies to remove noise in order to enforce reproducibility of research results neglect inherent nondeterminism at the implementation level and disregard crucial interaction effects between algorithmic noise factors and data properties. This limits the scope of conclusions that can be drawn from such experiments. Instead of removing noise, we propose to incorporate several sources of variance, including their interaction with data properties, into an analysis of significance and reliability of machine learning evaluation, with the aim to draw inferences beyond particular instances of trained models. We show how to use linear mixed effects models (LMEMs) to analyze performance evaluation scores, and to conduct statistical inference with a generalized likelihood ratio test (GLRT). This allows us to incorporate arbitrary sources of noise like meta-parameter variations into statistical significance testing, and to assess performance differences conditional on data properties. Furthermore, a variance component analysis (VCA) enables the analysis of the contribution of noise sources to overall variance and the computation of a reliability coefficient by the ratio of substantial to total variance.
    A prototype-oriented clustering for domain shift with source privacy. (arXiv:2302.03807v1 [cs.LG])
    Unsupervised clustering under domain shift (UCDS) studies how to transfer the knowledge from abundant unlabeled data from multiple source domains to learn the representation of the unlabeled data in a target domain. In this paper, we introduce Prototype-oriented Clustering with Distillation (PCD) to not only improve the performance and applicability of existing methods for UCDS, but also address the concerns on protecting the privacy of both the data and model of the source domains. PCD first constructs a source clustering model by aligning the distributions of prototypes and data. It then distills the knowledge to the target model through cluster labels provided by the source model while simultaneously clustering the target data. Finally, it refines the target model on the target domain data without guidance from the source model. Experiments across multiple benchmarks show the effectiveness and generalizability of our source-private clustering method.
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v1 [cs.LG])
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as an expensive black-box optimization problem over the discrete chemical space. Computational methods have achieved initial success but still struggle with simultaneously optimizing multiple competing properties in a sample-efficient manner. In this work, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. Inspired by reinforcement learning, we further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. Through synthetic experiments, we illustrate that HN-GFN has adequate capacity to generalize over preferences. Extensive experiments show that our framework outperforms the best baselines by a large margin in terms of hypervolume in various real-world MOBO settings.
    Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps. (arXiv:2302.04065v1 [stat.ML])
    Optimal transport (OT) theory focuses, among all maps $T:\mathbb{R}^d\rightarrow \mathbb{R}^d$ that can morph a probability measure onto another, on those that are the ``thriftiest'', i.e. such that the averaged cost $c(x, T(x))$ between $x$ and its image $T(x)$ be as small as possible. Many computational approaches have been proposed to estimate such Monge maps when $c$ is the $\ell_2^2$ distance, e.g., using entropic maps [Pooladian'22], or neural networks [Makkuva'20, Korotin'20]. We propose a new model for transport maps, built on a family of translation invariant costs $c(x, y):=h(x-y)$, where $h:=\tfrac{1}{2}\|\cdot\|_2^2+\tau$ and $\tau$ is a regularizer. We propose a generalization of the entropic map suitable for $h$, and highlight a surprising link tying it with the Bregman centroids of the divergence $D_h$ generated by $h$, and the proximal operator of $\tau$. We show that choosing a sparsity-inducing norm for $\tau$ results in maps that apply Occam's razor to transport, in the sense that the displacement vectors $\Delta(x):= T(x)-x$ they induce are sparse, with a sparsity pattern that varies depending on $x$. We showcase the ability of our method to estimate meaningful OT maps for high-dimensional single-cell transcription data, in the $34000$-$d$ space of gene counts for cells, without using dimensionality reduction, thus retaining the ability to interpret all displacements at the gene level.
    Clinical BioBERT Hyperparameter Optimization using Genetic Algorithm Clinical BioBERT Hyperparameter Optimization using Genetic Algorithm. (arXiv:2302.03822v1 [cs.AI])
    Clinical factors account only for a small portion, about 10-30%, of the controllable factors that affect an individual's health outcomes. The remaining factors include where a person was born and raised, where he/she pursued their education, what their work and family environment is like, etc. These factors are collectively referred to as Social Determinants of Health (SDoH). The majority of SDoH data is recorded in unstructured clinical notes by physicians and practitioners. Recording SDoH data in a structured manner (in an EHR) could greatly benefit from a dedicated ontology of SDoH terms. Our research focuses on extracting sentences from clinical notes, making use of such an SDoH ontology (called SOHO) to provide appropriate concepts. We utilize recent advancements in Deep Learning to optimize the hyperparameters of a Clinical BioBERT model for SDoH text. A genetic algorithm-based hyperparameter tuning regimen was implemented to identify optimal parameter settings. To implement a complete classifier, we pipelined Clinical BioBERT with two subsequent linear layers and two dropout layers. The output predicts whether a text fragment describes an SDoH issue of the patient. We compared the AdamW, Adafactor, and LAMB optimizers. In our experiments, AdamW outperformed the others in terms of accuracy.  ( 2 min )
    Explainable Label-flipping Attacks on Human Emotion Assessment System. (arXiv:2302.04109v1 [cs.LG])
    This paper's main goal is to provide an attacker's point of view on data poisoning assaults that use label-flipping during the training phase of systems that use electroencephalogram (EEG) signals to evaluate human emotion. To attack different machine learning classifiers such as Adaptive Boosting (AdaBoost) and Random Forest dedicated to the classification of 4 different human emotions using EEG signals, this paper proposes two scenarios of label-flipping methods. The results of the studies show that the proposed data poison attacksm based on label-flipping are successful regardless of the model, but different models show different degrees of resistance to the assaults. In addition, numerous Explainable Artificial Intelligence (XAI) techniques are used to explain the data poison attacks on EEG signal-based human emotion evaluation systems.
    WAT: Improve the Worst-class Robustness in Adversarial Training. (arXiv:2302.04025v1 [cs.LG])
    Deep Neural Networks (DNN) have been shown to be vulnerable to adversarial examples. Adversarial training (AT) is a popular and effective strategy to defend against adversarial attacks. Recent works (Benz et al., 2020; Xu et al., 2021; Tian et al., 2021) have shown that a robust model well-trained by AT exhibits a remarkable robustness disparity among classes, and propose various methods to obtain consistent robust accuracy across classes. Unfortunately, these methods sacrifice a good deal of the average robust accuracy. Accordingly, this paper proposes a novel framework of worst-class adversarial training and leverages no-regret dynamics to solve this problem. Our goal is to obtain a classifier with great performance on worst-class and sacrifice just a little average robust accuracy at the same time. We then rigorously analyze the theoretical properties of our proposed algorithm, and the generalization error bound in terms of the worst-class robust risk. Furthermore, we propose a measurement to evaluate the proposed method in terms of both the average and worst-class accuracies. Experiments on various datasets and networks show that our proposed method outperforms the state-of-the-art approaches.
    Revisit the Algorithm Selection Problem for TSP with Spatial Information Enhanced Graph Neural Networks. (arXiv:2302.04035v1 [cs.LG])
    Algorithm selection is a well-known problem where researchers investigate how to construct useful features representing the problem instances and then apply feature-based machine learning models to predict which algorithm works best with the given instance. However, even for simple optimization problems such as Euclidean Traveling Salesman Problem (TSP), there lacks a general and effective feature representation for problem instances. The important features of TSP are relatively well understood in the literature, based on extensive domain knowledge and post-analysis of the solutions. In recent years, Convolutional Neural Network (CNN) has become a popular approach to select algorithms for TSP. Compared to traditional feature-based machine learning models, CNN has an automatic feature-learning ability and demands less domain expertise. However, it is still required to generate intermediate representations, i.e., multiple images to represent TSP instances first. In this paper, we revisit the algorithm selection problem for TSP, and propose a novel Graph Neural Network (GNN), called GINES. GINES takes the coordinates of cities and distances between cities as input. It is composed of a new message-passing mechanism and a local neighborhood feature extractor to learn spatial information of TSP instances. We evaluate GINES on two benchmark datasets. The results show that GINES outperforms CNN and the original GINE models. It is better than the traditional handcrafted feature-based approach on one dataset. The code and dataset will be released in the final version of this paper.
    On the Richness of Calibration. (arXiv:2302.04118v1 [cs.LG])
    Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.
    Multi-view Feature Extraction based on Dual Contrastive Head. (arXiv:2302.03932v1 [cs.CV])
    Multi-view feature extraction is an efficient approach for alleviating the issue of dimensionality in highdimensional multi-view data. Contrastive learning (CL), which is a popular self-supervised learning method, has recently attracted considerable attention. Most CL-based methods were constructed only from the sample level. In this study, we propose a novel multiview feature extraction method based on dual contrastive head, which introduce structural-level contrastive loss into sample-level CL-based method. Structural-level CL push the potential subspace structures consistent in any two cross views, which assists sample-level CL to extract discriminative features more effectively. Furthermore, it is proven that the relationships between structural-level CL and mutual information and probabilistic intraand inter-scatter, which provides the theoretical support for the excellent performance. Finally, numerical experiments on six real datasets demonstrate the superior performance of the proposed method compared to existing methods.
    Learning-based Online Optimization for Autonomous Mobility-on-Demand Fleet Control. (arXiv:2302.03963v1 [math.OC])
    Autonomous mobility-on-demand systems are a viable alternative to mitigate many transportation-related externalities in cities, such as rising vehicle volumes in urban areas and transportation-related pollution. However, the success of these systems heavily depends on efficient and effective fleet control strategies. In this context, we study online control algorithms for autonomous mobility-on-demand systems and develop a novel hybrid combinatorial optimization enriched machine learning pipeline which learns online dispatching and rebalancing policies from optimal full-information solutions. We test our hybrid pipeline on large-scale real-world scenarios with different vehicle fleet sizes and various request densities. We show that our approach outperforms state-of-the-art greedy, and model-predictive control approaches with respect to various KPIs, e.g., by up to 17.1% and on average by 6.3% in terms of realized profit.
    Zero-shot Sim2Real Adaptation Across Environments. (arXiv:2302.04013v1 [cs.LG])
    Simulation based learning often provides a cost-efficient recourse to reinforcement learning applications in robotics. However, simulators are generally incapable of accurately replicating real-world dynamics, and thus bridging the sim2real gap is an important problem in simulation based learning. Current solutions to bridge the sim2real gap involve hybrid simulators that are augmented with neural residual models. Unfortunately, they require a separate residual model for each individual environment configuration (i.e., a fixed setting of environment variables such as mass, friction etc.), and thus are not transferable to new environments quickly. To address this issue, we propose a Reverse Action Transformation (RAT) policy which learns to imitate simulated policies in the real-world. Once learnt from a single environment, RAT can then be deployed on top of a Universal Policy Network to achieve zero-shot adaptation to new environments. We empirically evaluate our approach in a set of continuous control tasks and observe its advantage as a few-shot and zero-shot learner over competing baselines.
    Probabilistic Attention based on Gaussian Processes for Deep Multiple Instance Learning. (arXiv:2302.04061v1 [cs.LG])
    Multiple Instance Learning (MIL) is a weakly supervised learning paradigm that is becoming increasingly popular because it requires less labeling effort than fully supervised methods. This is especially interesting for areas where the creation of large annotated datasets remains challenging, as in medicine. Although recent deep learning MIL approaches have obtained state-of-the-art results, they are fully deterministic and do not provide uncertainty estimations for the predictions. In this work, we introduce the Attention Gaussian Process (AGP) model, a novel probabilistic attention mechanism based on Gaussian Processes for deep MIL. AGP provides accurate bag-level predictions as well as instance-level explainability, and can be trained end-to-end. Moreover, its probabilistic nature guarantees robustness to overfitting on small datasets and uncertainty estimations for the predictions. The latter is especially important in medical applications, where decisions have a direct impact on the patient's health. The proposed model is validated experimentally as follows. First, its behavior is illustrated in two synthetic MIL experiments based on the well-known MNIST and CIFAR-10 datasets, respectively. Then, it is evaluated in three different real-world cancer detection experiments. AGP outperforms state-of-the-art MIL approaches, including deterministic deep learning ones. It shows a strong performance even on a small dataset with less than 100 labels and generalizes better than competing methods on an external test set. Moreover, we experimentally show that predictive uncertainty correlates with the risk of wrong predictions, and therefore it is a good indicator of reliability in practice. Our code is publicly available.
    Approximately Optimal Core Shapes for Tensor Decompositions. (arXiv:2302.03886v1 [cs.DS])
    This work studies the combinatorial optimization problem of finding an optimal core tensor shape, also called multilinear rank, for a size-constrained Tucker decomposition. We give an algorithm with provable approximation guarantees for its reconstruction error via connections to higher-order singular values. Specifically, we introduce a novel Tucker packing problem, which we prove is NP-hard, and give a polynomial-time approximation scheme based on a reduction to the 2-dimensional knapsack problem with a matroid constraint. We also generalize our techniques to tree tensor network decompositions. We implement our algorithm using an integer programming solver, and show that its solution quality is competitive with (and sometimes better than) the greedy algorithm that uses the true Tucker decomposition loss at each step, while also running up to 1000x faster.
    Rover: An online Spark SQL tuning service via generalized transfer learning. (arXiv:2302.04046v1 [cs.LG])
    Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.
    QS-ADN: Quasi-Supervised Artifact Disentanglement Network for Low-Dose CT Image Denoising by Local Similarity Among Unpaired Data. (arXiv:2302.03916v1 [cs.LG])
    Deep learning has been successfully applied to low-dose CT (LDCT) image denoising for reducing potential radiation risk. However, the widely reported supervised LDCT denoising networks require a training set of paired images, which is expensive to obtain and cannot be perfectly simulated. Unsupervised learning utilizes unpaired data and is highly desirable for LDCT denoising. As an example, an artifact disentanglement network (ADN) relies on unparied images and obviates the need for supervision but the results of artifact reduction are not as good as those through supervised learning.An important observation is that there is often hidden similarity among unpaired data that can be utilized. This paper introduces a new learning mode, called quasi-supervised learning, to empower the ADN for LDCT image denoising.For every LDCT image, the best matched image is first found from an unpaired normal-dose CT (NDCT) dataset. Then, the matched pairs and the corresponding matching degree as prior information are used to construct and train our ADN-type network for LDCT denoising.The proposed method is different from (but compatible with) supervised and semi-supervised learning modes and can be easily implemented by modifying existing networks. The experimental results show that the method is competitive with state-of-the-art methods in terms of noise suppression and contextual fidelity. The code and working dataset are publicly available at https://github.com/ruanyuhui/ADN-QSDL.git.
    InMyFace: Inertial and Mechanomyography-Based Sensor Fusion for Wearable Facial Activity Recognition. (arXiv:2302.04024v1 [cs.LG])
    Recognizing facial activity is a well-understood (but non-trivial) computer vision problem. However, reliable solutions require a camera with a good view of the face, which is often unavailable in wearable settings. Furthermore, in wearable applications, where systems accompany users throughout their daily activities, a permanently running camera can be problematic for privacy (and legal) reasons. This work presents an alternative solution based on the fusion of wearable inertial sensors, planar pressure sensors, and acoustic mechanomyography (muscle sounds). The sensors were placed unobtrusively in a sports cap to monitor facial muscle activities related to facial expressions. We present our integrated wearable sensor system, describe data fusion and analysis methods, and evaluate the system in an experiment with thirteen subjects from different cultural backgrounds (eight countries) and both sexes (six women and seven men). In a one-model-per-user scheme and using a late fusion approach, the system yielded an average F1 score of 85.00% for the case where all sensing modalities are combined. With a cross-user validation and a one-model-for-all-user scheme, an F1 score of 79.00% was obtained for thirteen participants (six females and seven males). Moreover, in a hybrid fusion (cross-user) approach and six classes, an average F1 score of 82.00% was obtained for eight users. The results are competitive with state-of-the-art non-camera-based solutions for a cross-user study. In addition, our unique set of participants demonstrates the inclusiveness and generalizability of the approach.
    Fortuna: A Library for Uncertainty Quantification in Deep Learning. (arXiv:2302.04019v1 [cs.LG])
    We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to Flax-based deep neural networks trained from scratch for improved uncertainty quantification and accuracy. By providing a coherent framework for advanced uncertainty quantification methods, Fortuna simplifies the process of benchmarking and helps practitioners build robust AI systems.
    Information-Theoretic Diffusion. (arXiv:2302.03792v1 [cs.LG])
    Denoising diffusion models have spurred significant gains in density modeling and image generation, precipitating an industrial revolution in text-guided AI art generation. We introduce a new mathematical foundation for diffusion models inspired by classic results in information theory that connect Information with Minimum Mean Square Error regression, the so-called I-MMSE relations. We generalize the I-MMSE relations to exactly relate the data distribution to an optimal denoising regression problem, leading to an elegant refinement of existing diffusion bounds. This new insight leads to several improvements for probability distribution estimation, including theoretical justification for diffusion model ensembling. Remarkably, our framework shows how continuous and discrete probabilities can be learned with the same regression objective, avoiding domain-specific generative models used in variational methods. Code to reproduce experiments is provided at this http URL and simplified demonstration code is at this http URL  ( 2 min )
    Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions. (arXiv:2302.03764v1 [stat.ML])
    Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. Our technique allows interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, placing the method on the memory-quality Pareto frontier of several large scale benchmarks.  ( 2 min )
    Compositional Score Modeling for Simulation-based Inference. (arXiv:2209.14249v2 [cs.LG] UPDATED)
    Neural Posterior Estimation methods for simulation-based inference can be ill-suited for dealing with posterior distributions obtained by conditioning on multiple observations, as they tend to require a large number of simulator calls to learn accurate approximations. In contrast, Neural Likelihood Estimation methods can handle multiple observations at inference time after learning from individual observations, but they rely on standard inference methods, such as MCMC or variational inference, which come with certain performance drawbacks. We introduce a new method based on conditional score modeling that enjoys the benefits of both approaches. We model the scores of the (diffused) posterior distributions induced by individual observations, and introduce a way of combining the learned scores to approximately sample from the target posterior distribution. Our approach is sample-efficient, can naturally aggregate multiple observations at inference time, and avoids the drawbacks of standard inference methods.
    Fully-Dynamic Approximate Decision Trees With Worst-Case Update Time Guarantees. (arXiv:2302.03994v1 [cs.DS])
    We give the first algorithm that maintains an approximate decision tree over an arbitrary sequence of insertions and deletions of labeled examples, with strong guarantees on the worst-case running time per update request. For instance, we show how to maintain a decision tree where every vertex has Gini gain within an additive $\alpha$ of the optimum by performing $O\Big(\frac{d\,(\log n)^4}{\alpha^3}\Big)$ elementary operations per update, where $d$ is the number of features and $n$ the maximum size of the active set (the net result of the update requests). We give similar bounds for the information gain and the variance gain. In fact, all these bounds are corollaries of a more general result, stated in terms of decision rules -- functions that, given a set $S$ of labeled examples, decide whether to split $S$ or predict a label. Decision rules give a unified view of greedy decision tree algorithms regardless of the example and label domains, and lead to a general notion of $\epsilon$-approximate decision trees that, for natural decision rules such as those used by ID3 or C4.5, implies the gain approximation guarantees above. The heart of our work provides a deterministic algorithm that, given any decision rule and any $\epsilon > 0$, maintains an $\epsilon$-approximate tree using $O\!\left(\frac{d\, f(n)}{n} \operatorname{poly}\frac{h}{\epsilon}\right)$ operations per update, where $f(n)$ is the complexity of evaluating the rule over a set of $n$ examples and $h$ is the maximum height of the maintained tree.
    Classification of Methods to Reduce Clinical Alarm Signals for Remote Patient Monitoring: A Critical Review. (arXiv:2302.03885v1 [cs.LG])
    Remote Patient Monitoring (RPM) is an emerging technology paradigm that helps reduce clinician workload by automated monitoring and raising intelligent alarm signals. High sensitivity and intelligent data-processing algorithms used in RPM devices result in frequent false-positive alarms, resulting in alarm fatigue. This study aims to critically review the existing literature to identify the causes of these false-positive alarms and categorize the various interventions used in the literature to eliminate these causes. That act as a catalog and helps in false alarm reduction algorithm design. A step-by-step approach to building an effective alarm signal generator for clinical use has been proposed in this work. Second, the possible causes of false-positive alarms amongst RPM applications were analyzed from the literature. Third, a critical review has been done of the various interventions used in the literature depending on causes and classification based on four major approaches: clinical knowledge, physiological data, medical sensor devices, and clinical environments. A practical clinical alarm strategy could be developed by following our pentagon approach. The first phase of this approach emphasizes identifying the various causes for the high number of false-positive alarms. Future research will focus on developing a false alarm reduction method using data mining.
    Cut your Losses with Squentropy. (arXiv:2302.03952v1 [cs.LG])
    Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.
    Structural hierarchical learning for energy networks. (arXiv:2302.03978v1 [cs.LG])
    Many sectors nowadays require accurate and coherent predictions across their organization to effectively operate. Otherwise, decision-makers would be planning using disparate views of the future, resulting in inconsistent decisions across their sectors. To secure coherency across hierarchies, recent research has put forward hierarchical learning, a coherency-informed hierarchical regressor leveraging the power of machine learning thanks to a custom loss function founded on optimal reconciliation methods. While promising potentials were outlined, results exhibited discordant performances in which coherency information only improved hierarchical forecasts in one setting. This work proposes to tackle these obstacles by investigating custom neural network designs inspired by the topological structures of hierarchies. Results unveil that, in a data-limited setting, structural models with fewer connections perform overall best and demonstrate the coherency information value for both accuracy and coherency forecasting performances, provided individual forecasts were generated within reasonable accuracy limits. Overall, this work expands and improves hierarchical learning methods thanks to a structurally-scaled learning mechanism extension coupled with tailored network designs, producing a resourceful, data-efficient, and information-rich learning process.
    Efficient Joint Learning for Clinical Named Entity Recognition and Relation Extraction Using Fourier Networks: A Use Case in Adverse Drug Events. (arXiv:2302.04185v1 [cs.CL])
    Current approaches for clinical information extraction are inefficient in terms of computational costs and memory consumption, hindering their application to process large-scale electronic health records (EHRs). We propose an efficient end-to-end model, the Joint-NER-RE-Fourier (JNRF), to jointly learn the tasks of named entity recognition and relation extraction for documents of variable length. The architecture uses positional encoding and unitary batch sizes to process variable length documents and uses a weight-shared Fourier network layer for low-complexity token mixing. Finally, we reach the theoretical computational complexity lower bound for relation extraction using a selective pooling strategy and distance-aware attention weights with trainable polynomial distance functions. We evaluated the JNRF architecture using the 2018 N2C2 ADE benchmark to jointly extract medication-related entities and relations in variable-length EHR summaries. JNRF outperforms rolling window BERT with selective pooling by 0.42%, while being twice as fast to train. Compared to state-of-the-art BiLSTM-CRF architectures on the N2C2 ADE benchmark, results show that the proposed approach trains 22 times faster and reduces GPU memory consumption by 1.75 folds, with a reasonable performance tradeoff of 90%, without the use of external tools, hand-crafted rules or post-processing. Given the significant carbon footprint of deep learning models and the current energy crises, these methods could support efficient and cleaner information extraction in EHRs and other types of large-scale document databases.
    Fast Linear Model Trees by PILOT. (arXiv:2302.03931v1 [stat.ML])
    Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for $PI$ecewise $L$inear $O$rganic $T$ree, where `organic' refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.
    ASTRIDE: Adaptive Symbolization for Time Series Databases. (arXiv:2302.04097v1 [cs.LG])
    We introduce ASTRIDE (Adaptive Symbolization for Time seRIes DatabasEs), a novel symbolic representation of time series, along with its accelerated variant FASTRIDE (Fast ASTRIDE). Unlike most symbolization procedures, ASTRIDE is adaptive during both the segmentation step by performing change-point detection and the quantization step by using quantiles. Instead of proceeding signal by signal, ASTRIDE builds a dictionary of symbols that is common to all signals in a data set. We also introduce D-GED (Dynamic General Edit Distance), a novel similarity measure on symbolic representations based on the general edit distance. We demonstrate the performance of the ASTRIDE and FASTRIDE representations compared to SAX (Symbolic Aggregate approXimation), 1d-SAX, SFA (Symbolic Fourier Approximation), and ABBA (Adaptive Brownian Bridge-based Aggregation) on reconstruction and, when applicable, on classification tasks. These algorithms are evaluated on 86 univariate equal-size data sets from the UCR Time Series Classification Archive. An open source GitHub repository called astride is made available to reproduce all the experiments in Python.
    What Matters In The Structured Pruning of Generative Language Models?. (arXiv:2302.03773v1 [cs.CL])
    Auto-regressive large language models such as GPT-3 require enormous computational resources to use. Traditionally, structured pruning methods are employed to reduce resource usage. However, their application to and efficacy for generative language models is heavily under-explored. In this paper we conduct an comprehensive evaluation of common structured pruning methods, including magnitude, random, and movement pruning on the feed-forward layers in GPT-type models. Unexpectedly, random pruning results in performance that is comparable to the best established methods, across multiple natural language generation tasks. To understand these results, we provide a framework for measuring neuron-level redundancy of models pruned by different methods, and discover that established structured pruning methods do not take into account the distinctiveness of neurons, leaving behind excess redundancies. In view of this, we introduce Globally Unique Movement (GUM) to improve the uniqueness of neurons in pruned models. We then discuss the effects of our techniques on different redundancy metrics to explain the improved performance.
    Fairness in Matching under Uncertainty. (arXiv:2302.03810v1 [cs.LG])
    The prevalence and importance of algorithmic two-sided marketplaces has drawn attention to the issue of fairness in such settings. Algorithmic decisions are used in assigning students to schools, users to advertisers, and applicants to job interviews. These decisions should heed the preferences of individuals, and simultaneously be fair with respect to their merits (synonymous with fit, future performance, or need). Merits conditioned on observable features are always uncertain, a fact that is exacerbated by the widespread use of machine learning algorithms to infer merit from the observables. As our key contribution, we carefully axiomatize a notion of individual fairness in the two-sided marketplace setting which respects the uncertainty in the merits; indeed, it simultaneously recognizes uncertainty as the primary potential cause of unfairness and an approach to address it. We design a linear programming framework to find fair utility-maximizing distributions over allocations, and we show that the linear program is robust to perturbations in the estimated parameters of the uncertain merit distributions, a key property in combining the approach with machine learning techniques.
    Noise2Music: Text-conditioned Music Generation with Diffusion Models. (arXiv:2302.03917v1 [cs.SD])
    We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music
    Predictable MDP Abstraction for Unsupervised Model-Based RL. (arXiv:2302.03921v1 [cs.LG])
    A key component of model-based reinforcement learning (RL) is a dynamics model that predicts the outcomes of actions. Errors in this predictive model can degrade the performance of model-based controllers, and complex Markov decision processes (MDPs) can present exceptionally difficult prediction problems. To mitigate this issue, we propose predictable MDP abstraction (PMA): instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space that only permits predictable, easy-to-model actions, while covering the original state-action space as much as possible. As a result, model learning becomes easier and more accurate, which allows robust, stable model-based planning or model-based RL. This transformation is learned in an unsupervised manner, before any task is specified by the user. Downstream tasks can then be solved with model-based control in a zero-shot fashion, without additional environment interactions. We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches in a range of benchmark environments. Our code and videos are available at https://seohong.me/projects/pma/
    Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking. (arXiv:2302.03802v1 [cs.CV])
    This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with object queries. To explicitly use historical cues, our "Past Reasoning" module learns to refine the tracks and enhance the object features by cross-attending to queries from previous frames and other objects. The "Future Reasoning" module digests historical information and predicts robust future trajectories. In the case of long-term occlusions, our method maintains the object positions and enables re-association by integrating motion predictions. On the nuScenes dataset, our method improves AMOTA by a large margin and remarkably reduces ID-Switches by 90% compared to prior approaches, which is an order of magnitude less. The code and models are made available at https://github.com/TRI-ML/PF-Track.
    Layered State Discovery for Incremental Autonomous Exploration. (arXiv:2302.03789v1 [cs.LG])
    We study the autonomous exploration (AX) problem proposed by Lim & Auer (2012). In this setting, the objective is to discover a set of $\epsilon$-optimal policies reaching a set $\mathcal{S}_L^{\rightarrow}$ of incrementally $L$-controllable states. We introduce a novel layered decomposition of the set of incrementally $L$-controllable states that is based on the iterative application of a state-expansion operator. We leverage these results to design Layered Autonomous Exploration (LAE), a novel algorithm for AX that attains a sample complexity of $\tilde{\mathcal{O}}(LS^{\rightarrow}_{L(1+\epsilon)}\Gamma_{L(1+\epsilon)} A \ln^{12}(S^{\rightarrow}_{L(1+\epsilon)})/\epsilon^2)$, where $S^{\rightarrow}_{L(1+\epsilon)}$ is the number of states that are incrementally $L(1+\epsilon)$-controllable, $A$ is the number of actions, and $\Gamma_{L(1+\epsilon)}$ is the branching factor of the transitions over such states. LAE improves over the algorithm of Tarbouriech et al. (2020a) by a factor of $L^2$ and it is the first algorithm for AX that works in a countably-infinite state space. Moreover, we show that, under a certain identifiability assumption, LAE achieves minimax-optimal sample complexity of $\tilde{\mathcal{O}}(LS^{\rightarrow}_{L}A\ln^{12}(S^{\rightarrow}_{L})/\epsilon^2)$, outperforming existing algorithms and matching for the first time the lower bound proved by Cai et al. (2022) up to logarithmic factors.
    Prediction approaches for partly missing multi-omics covariate data: A literature review and an empirical comparison study. (arXiv:2302.03991v1 [q-bio.GN])
    As the availability of omics data has increased in the last few years, more multi-omics data have been generated, that is, high-dimensional molecular data consisting of several types such as genomic, transcriptomic, or proteomic data, all obtained from the same patients. Such data lend themselves to being used as covariates in automatic outcome prediction because each omics type may contribute unique information, possibly improving predictions compared to using only one omics data type. Frequently, however, in the training data and the data to which automatic prediction rules should be applied, the test data, the different omics data types are not available for all patients. We refer to this type of data as block-wise missing multi-omics data. First, we provide a literature review on existing prediction methods applicable to such data. Subsequently, using a collection of 13 publicly available multi-omics data sets, we compare the predictive performances of several of these approaches for different block-wise missingness patterns. Finally, we discuss the results of this empirical comparison study and draw some tentative conclusions.
    Intrinsic Rewards from Self-Organizing Feature Maps for Exploration in Reinforcement Learning. (arXiv:2302.04125v1 [cs.LG])
    We introduce an exploration bonus for deep reinforcement learning methods calculated using self-organising feature maps. Our method uses adaptive resonance theory (ART) providing online, unsupervised clustering to quantify the novelty of a state. This heuristic is used to add an intrinsic reward to the extrinsic reward signal for then to optimize the agent to maximize the sum of these two rewards. We find that this method was able to play the game Ordeal at a human level after a comparable number of training epochs to ICM arXiv:1705.05464. Agents augmented with RND arXiv:1810.12894 were unable to achieve the same level of performance in our space of hyperparameters.
    Robustness to Spurious Correlations Improves Semantic Out-of-Distribution Detection. (arXiv:2302.04132v1 [cs.LG])
    Methods which utilize the outputs or feature representations of predictive models have emerged as promising approaches for out-of-distribution (OOD) detection of image inputs. However, these methods struggle to detect OOD inputs that share nuisance values (e.g. background) with in-distribution inputs. The detection of shared-nuisance out-of-distribution (SN-OOD) inputs is particularly relevant in real-world applications, as anomalies and in-distribution inputs tend to be captured in the same settings during deployment. In this work, we provide a possible explanation for SN-OOD detection failures and propose nuisance-aware OOD detection to address them. Nuisance-aware OOD detection substitutes a classifier trained via empirical risk minimization and cross-entropy loss with one that 1. is trained under a distribution where the nuisance-label relationship is broken and 2. yields representations that are independent of the nuisance under this distribution, both marginally and conditioned on the label. We can train a classifier to achieve these objectives using Nuisance-Randomized Distillation (NuRD), an algorithm developed for OOD generalization under spurious correlations. Output- and feature-based nuisance-aware OOD detection perform substantially better than their original counterparts, succeeding even when detection based on domain generalization algorithms fails to improve performance.
    Concept Algebra for Text-Controlled Vision Models. (arXiv:2302.03693v1 [cs.CL])
    This paper concerns the control of text-guided generative models, where a user provides a natural language prompt and the model generates samples based on this input. Prompting is intuitive, general, and flexible. However, there are significant limitations: prompting can fail in surprising ways, and it is often unclear how to find a prompt that will elicit some desired target behavior. A core difficulty for developing methods to overcome these issues is that failures are know-it-when-you-see-it -- it's hard to fix bugs if you can't state precisely what the model should have done! In this paper, we introduce a formalization of "what the user intended" in terms of latent concepts implicit to the data generating process that the model was trained on. This formalization allows us to identify some fundamental limitations of prompting. We then use the formalism to develop concept algebra to overcome these limitations. Concept algebra is a way of directly manipulating the concepts expressed in the output through algebraic operations on a suitably defined representation of input prompts. We give examples using concept algebra to overcome limitations of prompting, including concept transfer through arithmetic, and concept nullification through projection. Code available at https://github.com/zihao12/concept-algebra.
    Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback. (arXiv:2302.03805v1 [cs.LG])
    In classic reinforcement learning (RL) and decision making problems, policies are evaluated with respect to a scalar reward function, and all optimal policies are the same with regards to their expected return. However, many real-world problems involve balancing multiple, sometimes conflicting, objectives whose relative priority will vary according to the preferences of each user. Consequently, a policy that is optimal for one user might be sub-optimal for another. In this work, we propose a multi-objective decision making framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our model consists of a Markov decision process with a vector-valued reward function, with each user having an unknown preference vector that expresses the relative importance of each objective. The goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models. We first address the case where a user is provided with two policies and returns their preferred policy as feedback. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one. In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a small number of comparison queries.
    Towards causally linking architectural parametrizations to algorithmic bias in neural networks. (arXiv:2302.03750v1 [cs.CV])
    Training dataset biases are by far the most scrutinized factors when explaining algorithmic biases of neural networks. In contrast, hyperparameters related to the neural network architecture, e.g., the number of layers or choice of activation functions, have largely been ignored even though different network parameterizations are known to induce different implicit biases over learned features. For example, convolutional kernel size has been shown to bias CNNs towards different frequencies. In order to study the effect of these hyperparameters, we designed a causal framework for linking an architectural hyperparameter to algorithmic bias. Our framework is experimental, in that several versions of a network are trained with an intervention to a specific hyperparameter, and the resulting causal effect of this choice on performance bias is measured. We focused on the causal relationship between sensitivity to high-frequency image details and face analysis classification performance across different subpopulations (race/gender). In this work, we show that modifying a CNN hyperparameter (convolutional kernel size), even in one layer of a CNN, will not only change a fundamental characteristic of the learned features (frequency content) but that this change can vary significantly across data subgroups (race/gender populations) leading to biased generalization performance even in the presence of a balanced dataset.
    GraphGUIDE: interpretable and controllable conditional graph generation with discrete Bernoulli diffusion. (arXiv:2302.03790v1 [cs.LG])
    Diffusion models achieve state-of-the-art performance in generating realistic objects and have been successfully applied to images, text, and videos. Recent work has shown that diffusion can also be defined on graphs, including graph representations of drug-like molecules. Unfortunately, it remains difficult to perform conditional generation on graphs in a way which is interpretable and controllable. In this work, we propose GraphGUIDE, a novel framework for graph generation using diffusion models, where edges in the graph are flipped or set at each discrete time step. We demonstrate GraphGUIDE on several graph datasets, and show that it enables full control over the conditional generation of arbitrary structural properties without relying on predefined labels. Our framework for graph diffusion can have a large impact on the interpretable conditional generation of graphs, including the generation of drug-like molecules with desired properties in a way which is informed by experimental evidence.
    Toward a Theory of Causation for Interpreting Neural Code Models. (arXiv:2302.03788v1 [cs.SE])
    Neural Language Models of Code, or Neural Code Models (NCMs), are rapidly progressing from research prototypes to commercial developer tools. As such, understanding the capabilities and limitations of such models is becoming critical. However, the abilities of these models are typically measured using automated metrics that often only reveal a portion of their real-world performance. While, in general, the performance of NCMs appears promising, currently much is unknown about how such models arrive at decisions. To this end, this paper introduces $do_{code}$, a post-hoc interpretability methodology specific to NCMs that is capable of explaining model predictions. $do_{code}$ is based upon causal inference to enable programming language-oriented explanations. While the theoretical underpinnings of $do_{code}$ are extensible to exploring different model properties, we provide a concrete instantiation that aims to mitigate the impact of spurious correlations by grounding explanations of model behavior in properties of programming languages. To demonstrate the practical benefit of $do_{code}$, we illustrate the insights that our framework can provide by performing a case study on two popular deep learning architectures and nine NCMs. The results of this case study illustrate that our studied NCMs are sensitive to changes in code syntax and statistically learn to predict tokens related to blocks of code (e.g., brackets, parenthesis, semicolon) with less confounding bias as compared to other programming language constructs. These insights demonstrate the potential of $do_{code}$ as a useful model debugging mechanism that may aid in discovering biases and limitations in NCMs.
    The XPRESS Challenge: Xray Projectomic Reconstruction -- Extracting Segmentation with Skeletons. (arXiv:2302.03819v1 [cs.CV])
    The wiring and connectivity of neurons form a structural basis for the function of the nervous system. Advances in volume electron microscopy (EM) and image segmentation have enabled mapping of circuit diagrams (connectomics) within local regions of the mouse brain. However, applying volume EM over the whole brain is not currently feasible due to technological challenges. As a result, comprehensive maps of long-range connections between brain regions are lacking. Recently, we demonstrated that X-ray holographic nanotomography (XNH) can provide high-resolution images of brain tissue at a much larger scale than EM. In particular, XNH is wellsuited to resolve large, myelinated axon tracts (white matter) that make up the bulk of long-range connections (projections) and are critical for inter-region communication. Thus, XNH provides an imaging solution for brain-wide projectomics. However, because XNH data is typically collected at lower resolutions and larger fields-of-view than EM, accurate segmentation of XNH images remains an important challenge that we present here. In this task, we provide volumetric XNH images of cortical white matter axons from the mouse brain along with ground truth annotations for axon trajectories. Manual voxel-wise annotation of ground truth is a time-consuming bottleneck for training segmentation networks. On the other hand, skeleton-based ground truth is much faster to annotate, and sufficient to determine connectivity. Therefore, we encourage participants to develop methods to leverage skeleton-based training. To this end, we provide two types of ground-truth annotations: a small volume of voxel-wise annotations and a larger volume with skeleton-based annotations. Entries will be evaluated on how accurately the submitted segmentations agree with the ground-truth skeleton annotations.
    AI and Core Electoral Processes: Mapping the Horizons. (arXiv:2302.03774v1 [cs.CY])
    Significant enthusiasm around AI uptake has been witnessed across societies globally. The electoral process -- the time, place and manner of elections within democratic nations -- has been among those very rare sectors in which AI has not penetrated much. Electoral management bodies in many countries have recently started exploring and deliberating over the use of AI in the electoral process. In this paper, we consider five representative avenues within the core electoral process which have potential for AI usage, and map the challenges involved in using AI within them. These five avenues are: voter list maintenance, determining polling booth locations, polling booth protection processes, voter authentication and video monitoring of elections. Within each of these avenues, we lay down the context, illustrate current or potential usage of AI, and discuss extant or potential ramifications of AI usage, and potential directions for mitigating risks while considering AI usage. We believe that the scant current usage of AI within electoral processes provides a very rare opportunity, that of being able to deliberate on the risks and mitigation possibilities, prior to real and widespread AI deployment. This paper is an attempt to map the horizons of risks and opportunities in using AI within the electoral processes and to help shape the debate around the topic.
    Participatory Systems for Personalized Prediction. (arXiv:2302.03874v1 [cs.LG])
    Machine learning models are often personalized based on information that is protected, sensitive, self-reported, or costly to acquire. These models use information about people, but do not facilitate nor inform their \emph{consent}. Individuals cannot opt out of reporting information that a model needs to personalize their predictions, nor tell if they would benefit from personalization in the first place. In this work, we introduce a new family of prediction models, called \emph{participatory systems}, that allow individuals to opt into personalization at prediction time. We present a model-agnostic algorithm to learn participatory systems for supervised learning tasks where models are personalized with categorical group attributes. We conduct a comprehensive empirical study of participatory systems in clinical prediction tasks, comparing them to common approaches for personalization and imputation. Our results demonstrate that participatory systems can facilitate and inform consent in a way that improves performance and privacy across all groups who report personal data.
    Data-driven Protection of Transformers, Phase Angle Regulators, and Transmission Lines in Interconnected Power Systems. (arXiv:2302.03826v1 [eess.SP])
    This dissertation highlights the growing interest in and adoption of machine learning (ML) approaches for fault detection in modern power grids. Once a fault has occurred, it must be identified quickly and preventative steps must be taken to remove or insulate it. As a result, detecting, locating, and classifying faults early and accurately can improve safety and dependability while reducing downtime and hardware damage. ML-based solutions and tools to carry out effective data processing and analysis to aid power system operations and decision-making are becoming preeminent with better system condition awareness and data availability. Power transformers, Phase Shift Transformers or Phase Angle Regulators, and transmission lines are critical components in power systems, and ensuring their safety is a primary issue. Differential relays are commonly employed to protect transformers, whereas distance relays are utilized to protect transmission lines. Magnetizing inrush, overexcitation, and current transformer saturation make transformer protection a challenge. Furthermore, non-standard phase shift, series core saturation, low turn-to-turn, and turn-to-ground fault currents are non-traditional problems associated with Phase Angle Regulators. Faults during symmetrical power swings and unstable power swings may cause mal-operation of distance relays and unintentional and uncontrolled islanding. The distance relays also mal-operate for transmission lines connected to type-3 wind farms. The conventional protection techniques would no longer be adequate to address the above challenges due to limitations in handling and analyzing massive amounts of data, limited generalizability, incapability to model non-linear systems, etc. These limitations of differential and distance protection methods bring forward the motivation of using ML in addressing various protection challenges.
    Persuading a Behavioral Agent: Approximately Best Responding and Learning. (arXiv:2302.03719v1 [cs.GT])
    The classic Bayesian persuasion model assumes a Bayesian and best-responding receiver. We study a relaxation of the Bayesian persuasion model where the receiver can approximately best respond to the sender's signaling scheme. We show that, under natural assumptions, (1) the sender can find a signaling scheme that guarantees itself an expected utility almost as good as its optimal utility in the classic model, no matter what approximately best-responding strategy the receiver uses; (2) on the other hand, there is no signaling scheme that gives the sender much more utility than its optimal utility in the classic model, even if the receiver uses the approximately best-responding strategy that is best for the sender. Together, (1) and (2) imply that the approximately best-responding behavior of the receiver does not affect the sender's maximal achievable utility a lot in the Bayesian persuasion problem. The proofs of both results rely on the idea of robustification of a Bayesian persuasion scheme: given a pair of the sender's signaling scheme and the receiver's strategy, we can construct another signaling scheme such that the receiver prefers to use that strategy in the new scheme more than in the original scheme, and the two schemes give the sender similar utilities. As an application of our main result (1), we show that, in a repeated Bayesian persuasion model where the receiver learns to respond to the sender by some algorithms, the sender can do almost as well as in the classic model. Interestingly, unlike (2), with a learning receiver the sender can sometimes do much better than in the classic model.
    Characterizing Financial Market Coverage using Artificial Intelligence. (arXiv:2302.03694v1 [q-fin.ST])
    This paper scrutinizes a database of over 4900 YouTube videos to characterize financial market coverage. Financial market coverage generates a large number of videos. Therefore, watching these videos to derive actionable insights could be challenging and complex. In this paper, we leverage Whisper, a speech-to-text model from OpenAI, to generate a text corpus of market coverage videos from Bloomberg and Yahoo Finance. We employ natural language processing to extract insights regarding language use from the market coverage. Moreover, we examine the prominent presence of trending topics and their evolution over time, and the impacts that some individuals and organizations have on the financial market. Our characterization highlights the dynamics of the financial market coverage and provides valuable insights reflecting broad discussions regarding recent financial events and the world economy.
    Federated Learning as Variational Inference: A Scalable Expectation Propagation Approach. (arXiv:2302.04228v1 [cs.LG])
    The canonical formulation of federated learning treats it as a distributed optimization problem where the model parameters are optimized against a global loss function that decomposes across client loss functions. A recent alternative formulation instead treats federated learning as a distributed inference problem, where the goal is to infer a global posterior from partitioned client data (Al-Shedivat et al., 2021). This paper extends the inference view and describes a variational inference formulation of federated learning where the goal is to find a global variational posterior that well-approximates the true posterior. This naturally motivates an expectation propagation approach to federated learning (FedEP), where approximations to the global posterior are iteratively refined through probabilistic message-passing between the central server and the clients. We conduct an extensive empirical study across various algorithmic considerations and describe practical strategies for scaling up expectation propagation to the modern federated setting. We apply FedEP on standard federated learning benchmarks and find that it outperforms strong baselines in terms of both convergence speed and accuracy.
    How to Trust Your Diffusion Model: A Convex Optimization Approach to Conformal Risk Control. (arXiv:2302.03791v1 [stat.ML])
    Score-based generative modeling, informally referred to as diffusion models, continue to grow in popularity across several important domains and tasks. While they provide high-quality and diverse samples from empirical distributions, important questions remain on the reliability and trustworthiness of these sampling procedures for their responsible use in critical scenarios. Conformal prediction is a modern tool to construct finite-sample, distribution-free uncertainty guarantees for any black-box predictor. In this work, we focus on image-to-image regression tasks and we present a generalization of the Risk-Controlling Prediction Sets (RCPS) procedure, that we term $K$-RCPS, which allows to $(i)$ provide entrywise calibrated intervals for future samples of any diffusion model, and $(ii)$ control a certain notion of risk with respect to a ground truth image with minimal mean interval length. Differently from existing conformal risk control procedures, ours relies on a novel convex optimization approach that allows for multidimensional risk control while provably minimizing the mean interval length. We illustrate our approach on two real-world image denoising problems: on natural images of faces as well as on computed tomography (CT) scans of the abdomen, demonstrating state of the art performance.
    Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion. (arXiv:2302.03775v1 [cs.LG])
    We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a $(\delta,\epsilon)$-stationary point from $O(\epsilon^{-4}\delta^{-1})$ stochastic gradient queries to $O(\epsilon^{-3}\delta^{-1})$, which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of $O(\epsilon^{-1.5}\delta^{-0.5})$. Our techniques also recover all optimal or best-known results for finding $\epsilon$ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.
    Taming Local Effects in Graph-based Spatiotemporal Forecasting. (arXiv:2302.04071v1 [cs.LG])
    Spatiotemporal graph neural networks have shown to be effective in time series forecasting applications, achieving better performance than standard univariate predictors in several settings. These architectures take advantage of a graph structure and relational inductive biases to learn a single (global) inductive model to predict any number of the input time series, each associated with a graph node. Despite the gain achieved in computational and data efficiency w.r.t. fitting a set of local models, relying on a single global model can be a limitation whenever some of the time series are generated by a different spatiotemporal stochastic process. The main objective of this paper is to understand the interplay between globality and locality in graph-based spatiotemporal forecasting, while contextually proposing a methodological framework to rationalize the practice of including trainable node embeddings in such architectures. We ascribe to trainable node embeddings the role of amortizing the learning of specialized components. Moreover, embeddings allow for 1) effectively combining the advantages of shared message-passing layers with node-specific parameters and 2) efficiently transferring the learned model to new node sets. Supported by strong empirical evidence, we provide insights and guidelines for specializing graph-based models to the dynamics of each time series and show how this aspect plays a crucial role in obtaining accurate predictions.
    Policy Evaluation in Decentralized POMDPs with Belief Sharing. (arXiv:2302.04151v1 [cs.LG])
    Most works on multi-agent reinforcement learning focus on scenarios where the state of the environment is fully observable. In this work, we consider a cooperative policy evaluation task in which agents are not assumed to observe the environment state directly. Instead, agents can only have access to noisy observations and to belief vectors. It is well-known that finding global posterior distributions under multi-agent settings is generally NP-hard. As a remedy, we propose a fully decentralized belief forming strategy that relies on individual updates and on localized interactions over a communication network. In addition to the exchange of the beliefs, agents exploit the communication network by exchanging value function parameter estimates as well. We analytically show that the proposed strategy allows information to diffuse over the network, which in turn allows the agents' parameters to have a bounded difference with a centralized baseline. A multi-sensor target tracking application is considered in the simulations.
    A Survey on Event Prediction Methods from a Systems Perspective: Bringing Together Disparate Research Areas. (arXiv:2302.04018v1 [cs.AI])
    Event prediction is the ability of anticipating future events, i.e., future real-world occurrences, and aims to support the user in deciding on actions that change future events towards a desired state. An event prediction method learns the relation between features of past events and future events. It is applied to newly observed events to predict corresponding future events that are evaluated with respect to the user's desired future state. If the predicted future events do not comply with this state, actions are taken towards achieving desirable future states. Evidently, event prediction is valuable in many application domains such as business and natural disasters. The diversity of application domains results in a diverse range of methods that are scattered across various research areas which, in turn, use different terminology for event prediction methods. Consequently, sharing methods and knowledge for developing future event prediction methods is restricted. To facilitate knowledge sharing on account of a comprehensive classification, integration, and assessment of event prediction methods, we combine taxonomies and take a systems perspective to integrate event prediction methods into a single system, elicit requirements and assess existing work with respect to the requirements. Based on the assessment, we identify open challenges and discuss future research directions.
    Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models. (arXiv:2302.04012v1 [cs.CR])
    Recently, large language models for code generation have achieved breakthroughs in several programming language tasks. Their advances in competition-level programming problems have made them an emerging pillar in AI-assisted pair programming. Tools such as GitHub Copilot are already part of the daily programming workflow and are used by more than a million developers. The training data for these models is usually collected from open-source repositories (e.g., GitHub) that contain software faults and security vulnerabilities. This unsanitized training data can lead language models to learn these vulnerabilities and propagate them in the code generation procedure. Given the wide use of these models in the daily workflow of developers, it is crucial to study the security aspects of these models systematically. In this work, we propose the first approach to automatically finding security vulnerabilities in black-box code generation models. To achieve this, we propose a novel black-box inversion approach based on few-shot prompting. We evaluate the effectiveness of our approach by examining code generation models in the generation of high-risk security weaknesses. We show that our approach automatically and systematically finds 1000s of security vulnerabilities in various code generation models, including the commercial black-box model GitHub Copilot.
    Leveraging User-Triggered Supervision in Contextual Bandits. (arXiv:2302.03784v1 [cs.LG])
    We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new framework to leverage such signals, while being robust to their biased nature. We also augment standard CB algorithms to leverage the signal, and show improved regret guarantees for the resulting algorithms under a variety of conditions on the helpfulness of and bias inherent in this feedback.
    A Systematic Performance Analysis of Deep Perceptual Loss Networks Breaks Transfer Learning Conventions. (arXiv:2302.04032v1 [cs.CV])
    Deep perceptual loss is a type of loss function in computer vision that aims to mimic human perception by using the deep features extracted from neural networks. In recent years the method has been applied to great effect on a host of interesting computer vision tasks, especially for tasks with image or image-like outputs. Many applications of the method use pretrained networks, often convolutional networks, for loss calculation. Despite the increased interest and broader use, more effort is needed toward exploring which networks to use for calculating deep perceptual loss and from which layers to extract the features. This work aims to rectify this by systematically evaluating a host of commonly used and readily available, pretrained networks for a number of different feature extraction points on four existing use cases of deep perceptual loss. The four use cases are implementations of previous works where the selected networks and extraction points are evaluated instead of the networks and extraction points used in the original work. The experimental tasks are dimensionality reduction, image segmentation, super-resolution, and perceptual similarity. The performance on these four tasks, attributes of the networks, and extraction points are then used as a basis for an in-depth analysis. This analysis uncovers essential information regarding which architectures provide superior performance for deep perceptual loss and how to choose an appropriate extraction point for a particular task and dataset. Furthermore, the work discusses the implications of the results for deep perceptual loss and the broader field of transfer learning. The results break commonly held assumptions in transfer learning, which imply that deep perceptual loss deviates from most transfer learning settings or that these assumptions need a thorough re-evaluation.
    Futuristic Variations and Analysis in Fundus Images Corresponding to Biological Traits. (arXiv:2302.03839v1 [eess.IV])
    Fundus image captures rear of an eye, and which has been studied for the diseases identification, classification, segmentation, generation, and biological traits association using handcrafted, conventional, and deep learning methods. In biological traits estimation, most of the studies have been carried out for the age prediction and gender classification with convincing results. However, the current study utilizes the cutting-edge deep learning (DL) algorithms to estimate biological traits in terms of age and gender together with associating traits to retinal visuals. For the traits association, our study embeds aging as the label information into the proposed DL model to learn knowledge about the effected regions with aging. Our proposed DL models, named FAG-Net and FGC-Net, correspondingly estimate biological traits (age and gender) and generates fundus images. FAG-Net can generate multiple variants of an input fundus image given a list of ages as conditions. Our study analyzes fundus images and their corresponding association with biological traits, and predicts of possible spreading of ocular disease on fundus images given age as condition to the generative model. Our proposed models outperform the randomly selected state of-the-art DL models.
    DDeMON: Ontology-based function prediction by Deep Learning from Dynamic Multiplex Networks. (arXiv:2302.03907v1 [q-bio.GN])
    Biological systems can be studied at multiple levels of information, including gene, protein, RNA and different interaction networks levels. The goal of this work is to explore how the fusion of systems' level information with temporal dynamics of gene expression can be used in combination with non-linear approximation power of deep neural networks to predict novel gene functions in a non-model organism potato \emph{Solanum tuberosum}. We propose DDeMON (Dynamic Deep learning from temporal Multiplex Ontology-annotated Networks), an approach for scalable, systems-level inference of function annotation using time-dependent multiscale biological information. The proposed method, which is capable of considering billions of potential links between the genes of interest, was applied on experimental gene expression data and the background knowledge network to reliably classify genes with unknown function into five different functional ontology categories, linked to the experimental data set. Predicted novel functions of genes were validated using extensive protein domain search approach.
    Investigating the role of model-based learning in exploration and transfer. (arXiv:2302.04009v1 [cs.LG])
    State of the art reinforcement learning has enabled training agents on tasks of ever increasing complexity. However, the current paradigm tends to favor training agents from scratch on every new task or on collections of tasks with a view towards generalizing to novel task configurations. The former suffers from poor data efficiency while the latter is difficult when test tasks are out-of-distribution. Agents that can effectively transfer their knowledge about the world pose a potential solution to these issues. In this paper, we investigate transfer learning in the context of model-based agents. Specifically, we aim to understand when exactly environment models have an advantage and why. We find that a model-based approach outperforms controlled model-free baselines for transfer learning. Through ablations, we show that both the policy and dynamics model learnt through exploration matter for successful transfer. We demonstrate our results across three domains which vary in their requirements for transfer: in-distribution procedural (Crafter), in-distribution identical (RoboDesk), and out-of-distribution (Meta-World). Our results show that intrinsic exploration combined with environment models present a viable direction towards agents that are self-supervised and able to generalize to novel reward functions.
    Analyzing the Performance of Deep Encoder-Decoder Networks as Surrogates for a Diffusion Equation. (arXiv:2302.03786v1 [cs.LG])
    Neural networks (NNs) have proven to be a viable alternative to traditional direct numerical algorithms, with the potential to accelerate computational time by several orders of magnitude. In the present paper we study the use of encoder-decoder convolutional neural network (CNN) as surrogates for steady-state diffusion solvers. The construction of such surrogates requires the selection of an appropriate task, network architecture, training set structure and size, loss function, and training algorithm hyperparameters. It is well known that each of these factors can have a significant impact on the performance of the resultant model. Our approach employs an encoder-decoder CNN architecture, which we posit is particularly well-suited for this task due to its ability to effectively transform data, as opposed to merely compressing it. We systematically evaluate a range of loss functions, hyperparameters, and training set sizes. Our results indicate that increasing the size of the training set has a substantial effect on reducing performance fluctuations and overall error. Additionally, we observe that the performance of the model exhibits a logarithmic dependence on the training set size. Furthermore, we investigate the effect on model performance by using different subsets of data with varying features. Our results highlight the importance of sampling the configurational space in an optimal manner, as this can have a significant impact on the performance of the model and the required training time. In conclusion, our results suggest that training a model with a pre-determined error performance bound is not a viable approach, as it does not guarantee that edge cases with errors larger than the bound do not exist. Furthermore, as most surrogate tasks involve a high dimensional landscape, an ever increasing training set size is, in principle, needed, however it is not a practical solution.
    A Weighted Normalized Boundary Loss for Reducing the Hausdorff Distance in Medical Imaging Segmentation. (arXiv:2302.03868v1 [eess.IV])
    Within medical imaging segmentation, the Dice coefficient and Hausdorff-based metrics are standard measures of success for deep learning models. However, modern loss functions for medical image segmentation often only consider the Dice coefficient or similar region-based metrics during training. As a result, segmentation architectures trained over such loss functions run the risk of achieving high accuracy for the Dice coefficient but low accuracy for Hausdorff-based metrics. Low accuracy on Hausdorff-based metrics can be problematic for applications such as tumor segmentation, where such benchmarks are crucial. For example, high Dice scores accompanied by significant Hausdorff errors could indicate that the predictions fail to detect small tumors. We propose the Weighted Normalized Boundary Loss, a novel loss function to minimize Hausdorff-based metrics with more desirable numerical properties than current methods and with weighting terms for class imbalance. Our loss function outperforms other losses when tested on the BraTS dataset using a standard 3D U-Net and the state-of-the-art nnUNet architectures. These results suggest we can improve segmentation accuracy with our novel loss function.
    SLaM: Student-Label Mixing for Semi-Supervised Knowledge Distillation. (arXiv:2302.03806v1 [cs.LG])
    Semi-supervised knowledge distillation is a powerful training paradigm for generating compact and lightweight student models in settings where the amount of labeled data is limited but one has access to a large pool of unlabeled data. The idea is that a large teacher model is utilized to generate ``smoothed'' pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher's pseudo-labels are often noisy, leading to impaired student performance. In this paper, we present a principled method for semi-supervised knowledge distillation that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called ``forward loss-adjustment" methods.
    Does the End Justify the Means? On the Moral Justification of Fairness-Aware Machine Learning. (arXiv:2202.08536v2 [cs.LG] UPDATED)
    Fairness-aware machine learning (fair-ml) techniques are algorithmic interventions designed to ensure that individuals who are affected by the predictions of a machine learning model are treated fairly, typically measured in terms of a quantitative fairness metric. Despite the multitude of fairness metrics and fair-ml algorithms, there is still little guidance on the suitability of different approaches in practice. In this paper, we present a framework for moral reasoning about the justification of fairness metrics and explore the moral implications of the use of fair-ml algorithms that optimize for them. In particular, we argue that whether a distribution of outcomes is fair, depends not only on the cause of inequalities but also on what moral claims decision subjects have to receive a particular benefit or avoid a burden. We use our framework to analyze the suitability of two fairness metrics under different circumstances. Subsequently, we explore moral arguments that support or reject the use of the fair-ml algorithm introduced by Hardt et al. (2016). We argue that under very specific circumstances, particular metrics correspond to a fair distribution of burdens and benefits. However, we also illustrate that enforcing a fairness metric by means of a fair-ml algorithm may not result in the fair distribution of outcomes and can have several undesirable side effects. We end with a call for a more holistic evaluation of fair-ml algorithms, beyond their direct optimization objectives.
    Finding Short Signals in Long Irregular Time Series with Continuous-Time Attention Policy Networks. (arXiv:2302.04052v1 [cs.LG])
    Irregularly-sampled time series (ITS) are native to high-impact domains like healthcare, where measurements are collected over time at uneven intervals. However, for many classification problems, only small portions of long time series are often relevant to the class label. In this case, existing ITS models often fail to classify long series since they rely on careful imputation, which easily over- or under-samples the relevant regions. Using this insight, we then propose CAT, a model that classifies multivariate ITS by explicitly seeking highly-relevant portions of an input series' timeline. CAT achieves this by integrating three components: (1) A Moment Network learns to seek relevant moments in an ITS's continuous timeline using reinforcement learning. (2) A Receptor Network models the temporal dynamics of both observations and their timing localized around predicted moments. (3) A recurrent Transition Model models the sequence of transitions between these moments, cultivating a representation with which the series is classified. Using synthetic and real data, we find that CAT outperforms ten state-of-the-art methods by finding short signals in long irregular time series.
    Non-zero-sum Game Control for Multi-vehicle Driving via Reinforcement Learning. (arXiv:2302.03958v1 [cs.AI])
    When a vehicle drives on the road, its behaviors will be affected by surrounding vehicles. Prediction and decision should not be considered as two separate stages because all vehicles make decisions interactively. This paper constructs the multi-vehicle driving scenario as a non-zero-sum game and proposes a novel game control framework, which consider prediction, decision and control as a whole. The mutual influence of interactions between vehicles is considered in this framework because decisions are made by Nash equilibrium strategy. To efficiently obtain the strategy, ADP, a model-based reinforcement learning method, is used to solve coupled Hamilton-Jacobi-Bellman equations. Driving performance is evaluated by tracking, efficiency, safety and comfort indices. Experiments show that our algorithm could drive perfectly by directly controlling acceleration and steering angle. Vehicles could learn interactive behaviors such as overtaking and pass. In summary, we propose a non-zero-sum game framework for modeling multi-vehicle driving, provide an effective way to solve the Nash equilibrium driving strategy, and validate at non-signalized intersections.
  • Open

    Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models. (arXiv:2302.03900v1 [cs.CV])
    Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.  ( 2 min )
    Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control. (arXiv:2205.11956v2 [stat.ML] UPDATED)
    Most machine learning methods require tuning of hyper-parameters. For kernel ridge regression (KRR) with the Gaussian kernel, the hyper-parameter is the bandwidth. The bandwidth specifies the length-scale of the kernel and has to be carefully selected in order to obtain a model with good generalization. The default method for bandwidth selection is cross-validation which often yields good results, albeit at high computational costs. Furthermore, the estimates provided by cross-validation tend to have very high variance, especially when training data are scarce. Inspired by Jacobian regularization, we formulate an approximate expression for how the derivatives of the functions inferred by KRR with the Gaussian kernel depend on the kernel bandwidth. We then use this expression to propose a closed-form, computationally feather-light, bandwidth selection heuristic based on controlling the Jacobian. In addition, the Jacobian expression illuminates how the bandwidth selection is a trade-off between the smoothness of the inferred function, and the conditioning of the training data kernel matrix. We show on real and synthetic data that compared to cross-validation, our method is considerably more stable in terms of bandwidth selection, and, for small data sets, provides better predictions.
    Reactmine: a statistical search algorithm for inferring chemical reactions from time series data. (arXiv:2209.03185v2 [q-bio.QM] UPDATED)
    Inferring chemical reaction networks (CRN) from concentration time series is a challenge encouragedby the growing availability of quantitative temporal data at the cellular level. This motivates thedesign of algorithms to infer the preponderant reactions between the molecular species observed ina given biochemical process, and build CRN structure and kinetics models. Existing ODE-basedinference methods such as SINDy resort to least square regression combined with sparsity-enforcingpenalization, such as Lasso. However, we observe that these methods fail to learn sparse modelswhen the input time series are only available in wild type conditions, i.e. without the possibility toplay with combinations of zeroes in the initial conditions. We present a CRN inference algorithmwhich enforces sparsity by inferring reactions in a sequential fashion within a search tree of boundeddepth, ranking the inferred reaction candidates according to the variance of their kinetics on theirsupporting transitions, and re-optimizing the kinetic parameters of the CRN candidates on the wholetrace in a final pass. We show that Reactmine succeeds both on simulation data by retrievinghidden CRNs where SINDy fails, and on two real datasets, one of fluorescence videomicroscopyof cell cycle and circadian clock markers, the other one of biomedical measurements of systemiccircadian biomarkers possibly acting on clock gene expression in peripheral organs, by inferringpreponderant regulations in agreement with previous model-based analyses. The code is available athttps://gitlab.inria.fr/julmarti/crninf/ together with introductory notebooks.
    Transformers Can Do Bayesian Inference. (arXiv:2112.10510v6 [cs.LG] UPDATED)
    Currently, it is hard to reap the benefits of deep learning for Bayesian methods, which allow the explicit specification of prior knowledge and accurately capture model uncertainty. We present Prior-Data Fitted Networks (PFNs). PFNs leverage large-scale machine learning techniques to approximate a large set of posteriors. The only requirement for PFNs to work is the ability to sample from a prior distribution over supervised learning tasks (or functions). Our method restates the objective of posterior approximation as a supervised classification problem with a set-valued input: it repeatedly draws a task (or function) from the prior, draws a set of data points and their labels from it, masks one of the labels and learns to make probabilistic predictions for it based on the set-valued input of the rest of the data points. Presented with a set of samples from a new supervised learning task as input, PFNs make probabilistic predictions for arbitrary other data points in a single forward propagation, having learned to approximate Bayesian inference. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems, with over 200-fold speedups in multiple setups compared to current methods. We obtain strong results in very diverse areas such as Gaussian process regression, Bayesian neural networks, classification for small tabular data sets, and few-shot image classification, demonstrating the generality of PFNs. Code and trained PFNs are released at https://github.com/automl/TransformersCanDoBayesianInference.
    Relative Probability on Finite Outcome Spaces: A Systematic Examination of its Axiomatization, Properties, and Applications. (arXiv:2212.14555v2 [stat.ML] UPDATED)
    This work proposes a view of probability as a relative measure rather than an absolute one. To demonstrate this concept, we focus on finite outcome spaces and develop three fundamental axioms that establish requirements for relative probability functions. We then provide a library of examples of these functions and a system for composing them. Additionally, we discuss a relative version of Bayesian inference and its digital implementation. Finally, we prove the topological closure of the relative probability space, highlighting its ability to preserve information under limits.
    Revisiting the Linear-Programming Framework for Offline RL with General Function Approximation. (arXiv:2212.13861v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims to find an optimal policy for sequential decision-making using a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and provide a new reformulation that advances the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, also with careful choices of the function classes and initial state distributions. We hope our insights bring into light the use of LP formulations and the induced primal-dual minimax optimization, in offline RL.
    The Modern Mathematics of Deep Learning. (arXiv:2105.04026v2 [cs.LG] UPDATED)
    We describe the new field of mathematical analysis of deep learning. This field emerged around a list of research questions that were not answered within the classical framework of learning theory. These questions concern: the outstanding generalization power of overparametrized neural networks, the role of depth in deep architectures, the apparent absence of the curse of dimensionality, the surprisingly successful optimization performance despite the non-convexity of the problem, understanding what features are learned, why deep architectures perform exceptionally well in physical problems, and which fine aspects of an architecture affect the behavior of a learning task in which way. We present an overview of modern approaches that yield partial answers to these questions. For selected approaches, we describe the main ideas in more detail.
    Rank-1 Matrix Completion with Gradient Descent and Small Random Initialization. (arXiv:2212.09396v2 [stat.ML] UPDATED)
    The nonconvex formulation of matrix completion problem has received significant attention in recent years due to its affordable complexity compared to the convex formulation. Gradient descent (GD) is the simplest yet efficient baseline algorithm for solving nonconvex optimization problems. The success of GD has been witnessed in many different problems in both theory and practice when it is combined with random initialization. However, previous works on matrix completion require either careful initialization or regularizers to prove the convergence of GD. In this work, we study the rank-1 symmetric matrix completion and prove that GD converges to the ground truth when small random initialization is used. We show that in logarithmic amount of iterations, the trajectory enters the region where local convergence occurs. We provide an upper bound on the initialization size that is sufficient to guarantee the convergence and show that a larger initialization can be used as more samples are available. We observe that implicit regularization effect of GD plays a critical role in the analysis, and for the entire trajectory, it prevents each entry from becoming much larger than the others.
    Boundary Graph Neural Networks for 3D Simulations. (arXiv:2106.11299v5 [cs.LG] UPDATED)
    The abundance of data has given machine learning considerable momentum in natural sciences and engineering, though modeling of physical processes is often difficult. A particularly tough problem is the efficient representation of geometric boundaries. Triangularized geometric boundaries are well understood and ubiquitous in engineering applications. However, it is notoriously difficult to integrate them into machine learning approaches due to their heterogeneity with respect to size and orientation. In this work, we introduce an effective theory to model particle-boundary interactions, which leads to our new Boundary Graph Neural Networks (BGNNs) that dynamically modify graph structures to obey boundary conditions. The new BGNNs are tested on complex 3D granular flow processes of hoppers, rotating drums and mixers, which are all standard components of modern industrial machinery but still have complicated geometry. BGNNs are evaluated in terms of computational efficiency as well as prediction accuracy of particle flows and mixing entropies. BGNNs are able to accurately reproduce 3D granular flows within simulation uncertainties over hundreds of thousands of simulation timesteps. Most notably, in our experiments, particles stay within the geometric objects without using handcrafted conditions or restrictions.
    Riemannian block SPD coupling manifold and its application to optimal transport. (arXiv:2201.12933v2 [math.FA] UPDATED)
    In this work, we study the optimal transport (OT) problem between symmetric positive definite (SPD) matrix-valued measures. We formulate the above as a generalized optimal transport problem where the cost, the marginals, and the coupling are represented as block matrices and each component block is a SPD matrix. The summation of row blocks and column blocks in the coupling matrix are constrained by the given block-SPD marginals. We endow the set of such block-coupling matrices with a novel Riemannian manifold structure. This allows to exploit the versatile Riemannian optimization framework to solve generic SPD matrix-valued OT problems. We illustrate the usefulness of the proposed approach in several applications.
    Identify ambiguous tasks combining crowdsourced labels by weighting Areas Under the Margin. (arXiv:2209.15380v2 [cs.LG] UPDATED)
    In supervised learning - for instance in image classification - modern massive datasets are commonly labeled by a crowd of workers. The obtained labels in this crowdsourcing setting are then aggregated for training. The aggregation step generally leverages a per-worker trust score. Yet, such worker-centric approaches discard each task's ambiguity. Some intrinsically ambiguous tasks might even fool expert workers, which could eventually be harmful to the learning step. In a standard supervised learning setting - with one label per task - the Area Under the Margin (AUM) is tailored to identify mislabeled data. We adapt the AUM to identify ambiguous tasks in crowdsourced learning scenarios, introducing the Weighted AUM (WAUM). The WAUM is an average of AUMs weighted by task-dependent scores. We show that the WAUM can help discard ambiguous tasks from the training set, leading to better generalization or calibration performance. We report improvements over existing strategies for learning a crowd, both for simulated settings and for the CIFAR-10H, LabelMe and Music crowdsourced datasets.
    Fast Linear Model Trees by PILOT. (arXiv:2302.03931v1 [stat.ML])
    Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for $PI$ecewise $L$inear $O$rganic $T$ree, where `organic' refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.
    Federated Minimax Optimization with Client Heterogeneity. (arXiv:2302.04249v1 [cs.LG])
    Minimax optimization has seen a surge in interest with the advent of modern applications such as GANs, and it is inherently more challenging than simple minimization. The difficulty is exacerbated by the training data residing at multiple edge devices or \textit{clients}, especially when these clients can have heterogeneous datasets and local computation capabilities. We propose a general federated minimax optimization framework that subsumes such settings and several existing methods like Local SGDA. We show that naive aggregation of heterogeneous local progress results in optimizing a mismatched objective function -- a phenomenon previously observed in standard federated minimization. To fix this problem, we propose normalizing the client updates by the number of local steps undertaken between successive communication rounds. We analyze the convergence of the proposed algorithm for classes of nonconvex-concave and nonconvex-nonconcave functions and characterize the impact of heterogeneous client data, partial client participation, and heterogeneous local computations. Our analysis works under more general assumptions on the intra-client noise and inter-client heterogeneity than so far considered in the literature. For all the function classes considered, we significantly improve the existing computation and communication complexity results. Experimental results support our theoretical claims.
    Inverse Models for Estimating the Initial Condition of Spatio-Temporal Advection-Diffusion Processes. (arXiv:2302.04134v1 [stat.ME])
    Inverse problems involve making inference about unknown parameters of a physical process using observational data. This paper investigates an important class of inverse problems -- the estimation of the initial condition of a spatio-temporal advection-diffusion process using spatially sparse data streams. Three spatial sampling schemes are considered, including irregular, non-uniform and shifted uniform sampling. The irregular sampling scheme is the general scenario, while computationally efficient solutions are available in the spectral domain for non-uniform and shifted uniform sampling. For each sampling scheme, the inverse problem is formulated as a regularized convex optimization problem that minimizes the distance between forward model outputs and observations. The optimization problem is solved by the Alternating Direction Method of Multipliers algorithm, which also handles the situation when a linear inequality constraint (e.g., non-negativity) is imposed on the model output. Numerical examples are presented, code is made available on GitHub, and discussions are provided to generate some useful insights of the proposed inverse modeling approaches.
    TVAE: Triplet-Based Variational Autoencoder using Metric Learning. (arXiv:1802.04403v3 [stat.ML] UPDATED)
    Deep metric learning has been demonstrated to be highly effective in learning semantic representation and encoding information that can be used to measure data similarity, by relying on the embedding learned from metric learning. At the same time, variational autoencoder (VAE) has widely been used to approximate inference and proved to have a good performance for directed probabilistic models. However, for traditional VAE, the data label or feature information are intractable. Similarly, traditional representation learning approaches fail to represent many salient aspects of the data. In this project, we propose a novel integrated framework to learn latent embedding in VAE by incorporating deep metric learning. The features are learned by optimizing a triplet loss on the mean vectors of VAE in conjunction with standard evidence lower bound (ELBO) of VAE. This approach, which we call Triplet based Variational Autoencoder (TVAE), allows us to capture more fine-grained information in the latent embedding. Our model is tested on MNIST data set and achieves a high triplet accuracy of 95.60% while the traditional VAE (Kingma & Welling, 2013) achieves triplet accuracy of 75.08%.
    Connections and Equivalences between the Nystr\"om Method and Sparse Variational Gaussian Processes. (arXiv:2106.01121v2 [stat.ML] UPDATED)
    We investigate the connections between sparse approximation methods for making kernel methods and Gaussian processes (GPs) scalable to large-scale data, focusing on the Nystr\"om method and the Sparse Variational Gaussian Processes (SVGP). While sparse approximation methods for GPs and kernel methods share some algebraic similarities, the literature lacks a deep understanding of how and why they are related. This may pose an obstacle to the communications between the GP and kernel communities, making it difficult to transfer results from one side to the other. Our motivation is to remove this obstacle, by clarifying the connections between the sparse approximations for GPs and kernel methods. In this work, we study the two popular approaches, the Nystr\"om and SVGP approximations, in the context of a regression problem, and establish various connections and equivalences between them. In particular, we provide an RKHS interpretation of the SVGP approximation, and show that the Evidence Lower Bound of the SVGP contains the objective function of the Nystr\"om approximation, revealing the origin of the algebraic equivalence between the two approaches. We also study recently established convergence results for the SVGP and how they are related to the approximation quality of the Nystr\"om method.
    Extragradient-Type Methods with $\mathcal{O}(1/k)$ Convergence Rates for Co-Hypomonotone Inclusions. (arXiv:2302.04099v1 [math.OC])
    In this paper, we develop two ``Nesterov's accelerated'' variants of the well-known extragradient method to approximate a solution of a co-hypomonotone inclusion constituted by the sum of two operators, where one is Lipschitz continuous and the other is possibly multivalued. The first scheme can be viewed as an accelerated variant of Tseng's forward-backward-forward splitting method, while the second one is a variant of the reflected forward-backward splitting method, which requires only one evaluation of the Lipschitz operator, and one resolvent of the multivalued operator. Under a proper choice of the algorithmic parameters and appropriate conditions on the co-hypomonotone parameter, we theoretically prove that both algorithms achieve $\mathcal{O}(1/k)$ convergence rates on the norm of the residual, where $k$ is the iteration counter. Our results can be viewed as alternatives of a recent class of Halpern-type schemes for root-finding problems.
    Compositional Score Modeling for Simulation-based Inference. (arXiv:2209.14249v2 [cs.LG] UPDATED)
    Neural Posterior Estimation methods for simulation-based inference can be ill-suited for dealing with posterior distributions obtained by conditioning on multiple observations, as they tend to require a large number of simulator calls to learn accurate approximations. In contrast, Neural Likelihood Estimation methods can handle multiple observations at inference time after learning from individual observations, but they rely on standard inference methods, such as MCMC or variational inference, which come with certain performance drawbacks. We introduce a new method based on conditional score modeling that enjoys the benefits of both approaches. We model the scores of the (diffused) posterior distributions induced by individual observations, and introduce a way of combining the learned scores to approximately sample from the target posterior distribution. Our approach is sample-efficient, can naturally aggregate multiple observations at inference time, and avoids the drawbacks of standard inference methods.
    Making Progress Based on False Discoveries. (arXiv:2204.08809v2 [cs.LG] UPDATED)
    The study of adaptive data analysis examines how many statistical queries can be answered accurately using a fixed dataset while avoiding false discoveries (statistically inaccurate answers). In this paper, we tackle a question that precedes the field of study: Is data only valuable when it provides accurate answers to statistical queries? To answer this question, we use Stochastic Convex Optimization as a case study. In this model, algorithms are considered as analysts who query an estimate of the gradient of a noisy function at each iteration and move towards its minimizer. It is known that $O(1/\epsilon^2)$ examples can be used to minimize the objective function, but none of the existing methods depend on the accuracy of the estimated gradients along the trajectory. Therefore, we ask: How many samples are needed to minimize a noisy convex function if we require $\epsilon$-accurate estimates of $O(1/\epsilon^2)$ gradients? Or, might it be that inaccurate gradient estimates are \emph{necessary} for finding the minimum of a stochastic convex function at an optimal statistical rate? We provide two partial answers to this question. First, we show that a general analyst (queries that may be maliciously chosen) requires $\Omega(1/\epsilon^3)$ samples, ruling out the possibility of a foolproof mechanism. Second, we show that, under certain assumptions on the oracle, $\tilde \Omega(1/\epsilon^{2.5})$ samples are necessary for gradient descent to interact with the oracle. Our results are in contrast to classical bounds that show that $O(1/\epsilon^2)$ samples can optimize the population risk to an accuracy of $O(\epsilon)$, but with spurious gradients.
    Flow Matching for Generative Modeling. (arXiv:2210.02747v2 [cs.LG] UPDATED)
    We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.
    Evaluating probabilistic forecasts of extremes using continuous ranked probability score distributions. (arXiv:1905.04022v4 [stat.ME] UPDATED)
    Verifying probabilistic forecasts for extreme events is a highly active research area because popular media and public opinions are naturally focused on extreme events, and biased conclusions are readily made. In this context, classical verification methods tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated, and the well-known continuous ranked probability score (CRPS) is no exception. In this paper, we define a formal framework for assessing the behavior of forecast evaluation procedures with respect to extreme events, which we use to demonstrate that assessment based on the expectation of a proper score is not suitable for extremes. Alternatively, we propose studying the properties of the CRPS as a random variable by using extreme value theory to address extreme event verification. An index is introduced to compare calibrated forecasts, which summarizes the ability of probabilistic forecasts for predicting extremes. The strengths and limitations of this method are discussed using both theoretical arguments and simulations.
    Fortuna: A Library for Uncertainty Quantification in Deep Learning. (arXiv:2302.04019v1 [cs.LG])
    We present Fortuna, an open-source library for uncertainty quantification in deep learning. Fortuna supports a range of calibration techniques, such as conformal prediction that can be applied to any trained neural network to generate reliable uncertainty estimates, and scalable Bayesian inference methods that can be applied to Flax-based deep neural networks trained from scratch for improved uncertainty quantification and accuracy. By providing a coherent framework for advanced uncertainty quantification methods, Fortuna simplifies the process of benchmarking and helps practitioners build robust AI systems.
    Exploratory Analysis of Federated Learning Methods with Differential Privacy on MIMIC-III. (arXiv:2302.04208v1 [cs.LG])
    Background: Federated learning methods offer the possibility of training machine learning models on privacy-sensitive data sets, which cannot be easily shared. Multiple regulations pose strict requirements on the storage and usage of healthcare data, leading to data being in silos (i.e. locked-in at healthcare facilities). The application of federated algorithms on these datasets could accelerate disease diagnostic, drug development, as well as improve patient care. Methods: We present an extensive evaluation of the impact of different federation and differential privacy techniques when training models on the open-source MIMIC-III dataset. We analyze a set of parameters influencing a federated model performance, namely data distribution (homogeneous and heterogeneous), communication strategies (communication rounds vs. local training epochs), federation strategies (FedAvg vs. FedProx). Furthermore, we assess and compare two differential privacy (DP) techniques during model training: a stochastic gradient descent-based differential privacy algorithm (DP-SGD), and a sparse vector differential privacy technique (DP-SVT). Results: Our experiments show that extreme data distributions across sites (imbalance either in the number of patients or the positive label ratios between sites) lead to a deterioration of model performance when trained using the FedAvg strategy. This issue is resolved when using FedProx with the use of appropriate hyperparameter tuning. Furthermore, the results show that both differential privacy techniques can reach model performances similar to those of models trained without DP, however at the expense of a large quantifiable privacy leakage. Conclusions: We evaluate empirically the benefits of two federation strategies and propose optimal strategies for the choice of parameters when using differential privacy techniques.
    Operator Shifting for Model-based Policy Evaluation. (arXiv:2110.12658v3 [cs.LG] UPDATED)
    In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.  ( 2 min )
    Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions. (arXiv:2302.03764v1 [stat.ML])
    Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. Our technique allows interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, placing the method on the memory-quality Pareto frontier of several large scale benchmarks.  ( 2 min )
    PASTA: Pessimistic Assortment Optimization. (arXiv:2302.03821v1 [cs.LG])
    We consider a class of assortment optimization problems in an offline data-driven setting. A firm does not know the underlying customer choice model but has access to an offline dataset consisting of the historically offered assortment set, customer choice, and revenue. The objective is to use the offline dataset to find an optimal assortment. Due to the combinatorial nature of assortment optimization, the problem of insufficient data coverage is likely to occur in the offline dataset. Therefore, designing a provably efficient offline learning algorithm becomes a significant challenge. To this end, we propose an algorithm referred to as Pessimistic ASsortment opTimizAtion (PASTA for short) designed based on the principle of pessimism, that can correctly identify the optimal assortment by only requiring the offline data to cover the optimal assortment under general settings. In particular, we establish a regret bound for the offline assortment optimization problem under the celebrated multinomial logit model. We also propose an efficient computational procedure to solve our pessimistic assortment optimization problem. Numerical studies demonstrate the superiority of the proposed method over the existing baseline method.  ( 2 min )
    Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion. (arXiv:2302.03775v1 [cs.LG])
    We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a $(\delta,\epsilon)$-stationary point from $O(\epsilon^{-4}\delta^{-1})$ stochastic gradient queries to $O(\epsilon^{-3}\delta^{-1})$, which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of $O(\epsilon^{-1.5}\delta^{-0.5})$. Our techniques also recover all optimal or best-known results for finding $\epsilon$ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.  ( 2 min )
    Algorithmic Collective Action in Machine Learning. (arXiv:2302.04262v1 [cs.LG])
    We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm's learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: the case of a nonparametric optimal learning algorithm, a parametric risk minimizer, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective's size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform's learning algorithm.  ( 2 min )
    How to Trust Your Diffusion Model: A Convex Optimization Approach to Conformal Risk Control. (arXiv:2302.03791v1 [stat.ML])
    Score-based generative modeling, informally referred to as diffusion models, continue to grow in popularity across several important domains and tasks. While they provide high-quality and diverse samples from empirical distributions, important questions remain on the reliability and trustworthiness of these sampling procedures for their responsible use in critical scenarios. Conformal prediction is a modern tool to construct finite-sample, distribution-free uncertainty guarantees for any black-box predictor. In this work, we focus on image-to-image regression tasks and we present a generalization of the Risk-Controlling Prediction Sets (RCPS) procedure, that we term $K$-RCPS, which allows to $(i)$ provide entrywise calibrated intervals for future samples of any diffusion model, and $(ii)$ control a certain notion of risk with respect to a ground truth image with minimal mean interval length. Differently from existing conformal risk control procedures, ours relies on a novel convex optimization approach that allows for multidimensional risk control while provably minimizing the mean interval length. We illustrate our approach on two real-world image denoising problems: on natural images of faces as well as on computed tomography (CT) scans of the abdomen, demonstrating state of the art performance.  ( 2 min )
    Towards Inferential Reproducibility of Machine Learning Research. (arXiv:2302.04054v1 [cs.LG])
    Reliability of machine learning evaluation -- the consistency of observed evaluation scores across replicated model training runs -- is affected by several sources of nondeterminism which can be regarded as measurement noise. Current tendencies to remove noise in order to enforce reproducibility of research results neglect inherent nondeterminism at the implementation level and disregard crucial interaction effects between algorithmic noise factors and data properties. This limits the scope of conclusions that can be drawn from such experiments. Instead of removing noise, we propose to incorporate several sources of variance, including their interaction with data properties, into an analysis of significance and reliability of machine learning evaluation, with the aim to draw inferences beyond particular instances of trained models. We show how to use linear mixed effects models (LMEMs) to analyze performance evaluation scores, and to conduct statistical inference with a generalized likelihood ratio test (GLRT). This allows us to incorporate arbitrary sources of noise like meta-parameter variations into statistical significance testing, and to assess performance differences conditional on data properties. Furthermore, a variance component analysis (VCA) enables the analysis of the contribution of noise sources to overall variance and the computation of a reliability coefficient by the ratio of substantial to total variance.  ( 2 min )
    Sample-efficient Multi-objective Molecular Optimization with GFlowNets. (arXiv:2302.04040v1 [cs.LG])
    Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as an expensive black-box optimization problem over the discrete chemical space. Computational methods have achieved initial success but still struggle with simultaneously optimizing multiple competing properties in a sample-efficient manner. In this work, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. Inspired by reinforcement learning, we further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. Through synthetic experiments, we illustrate that HN-GFN has adequate capacity to generalize over preferences. Extensive experiments show that our framework outperforms the best baselines by a large margin in terms of hypervolume in various real-world MOBO settings.  ( 2 min )
    On the Richness of Calibration. (arXiv:2302.04118v1 [cs.LG])
    Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.  ( 2 min )
    Cut your Losses with Squentropy. (arXiv:2302.03952v1 [cs.LG])
    Nearly all practical neural models for classification are trained using cross-entropy loss. Yet this ubiquitous choice is supported by little theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests that training using the (rescaled) square loss is often superior in terms of the classification accuracy. In this paper we propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes. We provide an extensive set of experiments on multi-class classification problems showing that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy. We also demonstrate that it provides significantly better model calibration than either of these alternative losses and, furthermore, has less variance with respect to the random initialization. Additionally, in contrast to the square loss, squentropy loss can typically be trained using exactly the same optimization parameters, including the learning rate, as the standard cross-entropy loss, making it a true "plug-and-play" replacement. Finally, unlike the rescaled square loss, multiclass squentropy contains no parameters that need to be adjusted.  ( 2 min )
    Scalars are universal: Equivariant machine learning, structured like classical physics. (arXiv:2106.06610v4 [cs.LG] UPDATED)
    There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar\'e groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be universally expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. We complement our theory with numerical examples that show that the scalar-based method is simple, efficient, and scalable.  ( 2 min )
    Leveraging User-Triggered Supervision in Contextual Bandits. (arXiv:2302.03784v1 [cs.LG])
    We study contextual bandit (CB) problems, where the user can sometimes respond with the best action in a given context. Such an interaction arises, for example, in text prediction or autocompletion settings, where a poor suggestion is simply ignored and the user enters the desired text instead. Crucially, this extra feedback is user-triggered on only a subset of the contexts. We develop a new framework to leverage such signals, while being robust to their biased nature. We also augment standard CB algorithms to leverage the signal, and show improved regret guarantees for the resulting algorithms under a variety of conditions on the helpfulness of and bias inherent in this feedback.  ( 2 min )
    Learning How to Infer Partial MDPs for In-Context Adaptation and Exploration. (arXiv:2302.04250v1 [cs.LG])
    To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising approach, but it requires Bayesian inference and dynamic programming, which often involve unknowns (e.g., a prior) and costly computations. To address these difficulties, we use a transformer to learn an inference process from training tasks and consider a hypothesis space of partial models, represented as small Markov decision processes that are cheap for dynamic programming. In our version of the Symbolic Alchemy benchmark, our method's adaptation speed and exploration-exploitation balance approach those of an exact posterior sampling oracle. We also show that even though partial models exclude relevant information from the environment, they can nevertheless lead to good policies.  ( 2 min )
    Improved Langevin Monte Carlo for stochastic optimization via landscape modification. (arXiv:2302.03973v1 [math.PR])
    Given a target function $H$ to minimize or a target Gibbs distribution $\pi_{\beta}^0 \propto e^{-\beta H}$ to sample from in the low temperature, in this paper we propose and analyze Langevin Monte Carlo (LMC) algorithms that run on an alternative landscape as specified by $H^f_{\beta,c,1}$ and target a modified Gibbs distribution $\pi^f_{\beta,c,1} \propto e^{-\beta H^f_{\beta,c,1}}$, where the landscape of $H^f_{\beta,c,1}$ is a transformed version of that of $H$ which depends on the parameters $f,\beta$ and $c$. While the original Log-Sobolev constant affiliated with $\pi^0_{\beta}$ exhibits exponential dependence on both $\beta$ and the energy barrier $M$ in the low temperature regime, with appropriate tuning of these parameters and subject to assumptions on $H$, we prove that the energy barrier of the transformed landscape is reduced which consequently leads to polynomial dependence on both $\beta$ and $M$ in the modified Log-Sobolev constant associated with $\pi^f_{\beta,c,1}$. This yield improved total variation mixing time bounds and improved convergence toward a global minimum of $H$. We stress that the technique developed in this paper is not only limited to LMC and is broadly applicable to other gradient-based optimization or sampling algorithms.  ( 2 min )
    Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps. (arXiv:2302.04065v1 [stat.ML])
    Optimal transport (OT) theory focuses, among all maps $T:\mathbb{R}^d\rightarrow \mathbb{R}^d$ that can morph a probability measure onto another, on those that are the ``thriftiest'', i.e. such that the averaged cost $c(x, T(x))$ between $x$ and its image $T(x)$ be as small as possible. Many computational approaches have been proposed to estimate such Monge maps when $c$ is the $\ell_2^2$ distance, e.g., using entropic maps [Pooladian'22], or neural networks [Makkuva'20, Korotin'20]. We propose a new model for transport maps, built on a family of translation invariant costs $c(x, y):=h(x-y)$, where $h:=\tfrac{1}{2}\|\cdot\|_2^2+\tau$ and $\tau$ is a regularizer. We propose a generalization of the entropic map suitable for $h$, and highlight a surprising link tying it with the Bregman centroids of the divergence $D_h$ generated by $h$, and the proximal operator of $\tau$. We show that choosing a sparsity-inducing norm for $\tau$ results in maps that apply Occam's razor to transport, in the sense that the displacement vectors $\Delta(x):= T(x)-x$ they induce are sparse, with a sparsity pattern that varies depending on $x$. We showcase the ability of our method to estimate meaningful OT maps for high-dimensional single-cell transcription data, in the $34000$-$d$ space of gene counts for cells, without using dimensionality reduction, thus retaining the ability to interpret all displacements at the gene level.  ( 2 min )
    DIFF2: Differential Private Optimization via Gradient Differences for Nonconvex Distributed Learning. (arXiv:2302.03884v1 [cs.LG])
    Differential private optimization for nonconvex smooth objective is considered. In the previous work, the best known utility bound is $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ in terms of the squared full gradient norm, which is achieved by Differential Private Gradient Descent (DP-GD) as an instance, where $n$ is the sample size, $d$ is the problem dimensionality and $\varepsilon_\mathrm{DP}$ is the differential privacy parameter. To improve the best known utility bound, we propose a new differential private optimization framework called \emph{DIFF2 (DIFFerential private optimization via gradient DIFFerences)} that constructs a differential private global gradient estimator with possibly quite small variance based on communicated \emph{gradient differences} rather than gradients themselves. It is shown that DIFF2 with a gradient descent subroutine achieves the utility of $\widetilde O(d^{2/3}/(n\varepsilon_\mathrm{DP})^{4/3})$, which can be significantly better than the previous one in terms of the dependence on the sample size $n$. To the best of our knowledge, this is the first fundamental result to improve the standard utility $\widetilde O(\sqrt{d}/(n\varepsilon_\mathrm{DP}))$ for nonconvex objectives. Additionally, a more computational and communication efficient subroutine is combined with DIFF2 and its theoretical analysis is also given. Numerical experiments are conducted to validate the superiority of DIFF2 framework.  ( 2 min )
    Gibbsian polar slice sampling. (arXiv:2302.03945v1 [stat.ME])
    Polar slice sampling (Roberts & Rosenthal, 2002) is a Markov chain approach for approximate sampling of distributions that is difficult, if not impossible, to implement efficiently, but behaves provably well with respect to the dimension. By updating the directional and radial components of chain iterates separately, we obtain a family of samplers that mimic polar slice sampling, and yet can be implemented efficiently. Numerical experiments for a variety of settings indicate that our proposed algorithm outperforms the two most closely related approaches, elliptical slice sampling (Murray et al., 2010) and hit-and-run uniform slice sampling (MacKay, 2003). We prove the well-definedness and convergence of our methods under suitable assumptions on the target distribution.  ( 2 min )
    Decision trees compensate for model misspecification. (arXiv:2302.04081v1 [stat.ML])
    The best-performing models in ML are not interpretable. If we can explain why they outperform, we may be able to replicate these mechanisms and obtain both interpretability and performance. One example are decision trees and their descendent gradient boosting machines (GBMs). These perform well in the presence of complex interactions, with tree depth governing the order of interactions. However, interactions cannot fully account for the depth of trees found in practice. We confirm 5 alternative hypotheses about the role of tree depth in performance in the absence of true interactions, and present results from experiments on a battery of datasets. Part of the success of tree models is due to their robustness to various forms of mis-specification. We present two methods for robust generalized linear models (GLMs) addressing the composite and mixed response scenarios.  ( 2 min )
    IRTCI: Item Response Theory for Categorical Imputation. (arXiv:2302.04165v1 [stat.ML])
    Most datasets suffer from partial or complete missing values, which has downstream limitations on the available models on which to test the data and on any statistical inferences that can be made from the data. Several imputation techniques have been designed to replace missing data with stand in values. The various approaches have implications for calculating clinical scores, model building and model testing. The work showcased here offers a novel means for categorical imputation based on item response theory (IRT) and compares it against several methodologies currently used in the machine learning field including k-nearest neighbors (kNN), multiple imputed chained equations (MICE) and Amazon Web Services (AWS) deep learning method, Datawig. Analyses comparing these techniques were performed on three different datasets that represented ordinal, nominal and binary categories. The data were modified so that they also varied on both the proportion of data missing and the systematization of the missing data. Two different assessments of performance were conducted: accuracy in reproducing the missing values, and predictive performance using the imputed data. Results demonstrated that the new method, Item Response Theory for Categorical Imputation (IRTCI), fared quite well compared to currently used methods, outperforming several of them in many conditions. Given the theoretical basis for the new approach, and the unique generation of probabilistic terms for determining category belonging for missing cells, IRTCI offers a viable alternative to current approaches.  ( 2 min )
    Concept Algebra for Text-Controlled Vision Models. (arXiv:2302.03693v1 [cs.CL])
    This paper concerns the control of text-guided generative models, where a user provides a natural language prompt and the model generates samples based on this input. Prompting is intuitive, general, and flexible. However, there are significant limitations: prompting can fail in surprising ways, and it is often unclear how to find a prompt that will elicit some desired target behavior. A core difficulty for developing methods to overcome these issues is that failures are know-it-when-you-see-it -- it's hard to fix bugs if you can't state precisely what the model should have done! In this paper, we introduce a formalization of "what the user intended" in terms of latent concepts implicit to the data generating process that the model was trained on. This formalization allows us to identify some fundamental limitations of prompting. We then use the formalism to develop concept algebra to overcome these limitations. Concept algebra is a way of directly manipulating the concepts expressed in the output through algebraic operations on a suitably defined representation of input prompts. We give examples using concept algebra to overcome limitations of prompting, including concept transfer through arithmetic, and concept nullification through projection. Code available at https://github.com/zihao12/concept-algebra.  ( 2 min )

  • Open

    Where can you create AI music like this? Sounds surprisingly good
    submitted by /u/ziroxonline [link] [comments]  ( 40 min )
    AI software developer: specs -> product
    Today we have many different systems to develop a business application that meets our businesses. For example, many companies use salesforce, ms dynamics, sap and many other systems as ERP; at the same time today, we have systems like airtables, google appsheet - no-low-code platforms - generally abstract configurable databases with low-code features to write custom logic. What if we can develop some "AI specs translator" based on Chat GPT that would be able to parse specification and translate it into the final deployed system (specs into API requests for no code platform)? Example (specs): User roles: Administrator Warehouse manager Sales rep Tables: Product. Fields: id, name, description Warehouse: Fields: id, reference to product, count Order: id, reference to product, count, status Views: Order page: the page allows a sales rep to place an order for a product Kanban board page allows warehouse managers to complete created orders. Do you think it would work? submitted by /u/evgenyzhurko [link] [comments]  ( 41 min )
    How do you discover new AI?
    Is there a consumer hub/site that acts as a directory? So many AI tools and companies popping up every day - I feel like there must be some kind of marketplace or 'App Store' equivalent just for AI? submitted by /u/mesjer123 [link] [comments]  ( 41 min )
    Use florid, baroque language to describe a baked potato
    submitted by /u/Imagine-your-success [link] [comments]  ( 40 min )
    AI certificates?
    So ive heard this is a thing, what kinds of certificates are there and are these enough to enter the workforce/freelance? Im a graduate in Nanotechnology but i couldn't get anything out of it not even unpaid internships, so im looking for other options, specifically certifications as to start small. And since i have some coding knowledge AI certification got me curious. submitted by /u/AffectionateAffect5 [link] [comments]  ( 41 min )
    I made Image analyzing discord bot that can write Poem/Description/Title of your image using GPT3 and CLIP, link in comment
    submitted by /u/red3vil96 [link] [comments]  ( 41 min )
    Good AI for removing vocals from songs?
    Hello everyone, I run a D&D game and there are many songs with lyrics that I want the instrumental version of to play during battles, in towns, etc. Most of them do not have an instrumental version that I can find. Does anyone know any good AI programs that remove vocals from songs? Free is preferable because I’m a broke college student but I am willing to pay if needed as I want a good end product. submitted by /u/Kittykittyredcat [link] [comments]  ( 41 min )
    OpenAI Text Classifier Hands-On Review: ChatGPT’s Own AI Detection - Pros & Cons
    The following guide provides an independent review of how well this OpenAI detection software performs and how its capabilities stack up against competitors (for finding A!-generated text and plagiarism) OpenAI Text Classifier: ChatGPT’s Own AI Detection - Review submitted by /u/thumbsdrivesmecrazy [link] [comments]  ( 41 min )
    Microsoft Bing’s Overnight Success Story: 10x Downloads After ChatGPT Integration Announcement
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    what if ai was making the best jokes for each person
    Hello, yall have propably seen that ai generated tv show on twitch Seinfeld. And so i thought about an ai making jokes specifically for you, jokes that you laught at every time u hear them. But i think that after some time, you would stop laughing at them. My question is, will u still find something funny, when the best jokes made specifically for you arent funny anymore ? (Or do u think we will endlessly laught at those jokes?) ​ Sorry for my english xd submitted by /u/maxovina [link] [comments]  ( 41 min )
    "Welcome to Carl's Jr., fuck you, I'm eating."
    submitted by /u/sifeliz [link] [comments]  ( 40 min )
    AI that generates 1-click replies based on auto-categorised emails. If you add in info like your calendar link to send, it's like having an executive assistant for your inbox
    submitted by /u/Ok-Craft-9908 [link] [comments]  ( 41 min )
    Share your fine-tuned GPT model with the world.
    Hey Guys, I and some friends are mostly done building a platform that gives users API access to fine-tuned GPT models. This is something we wanted to see in the world and decided to build ourselves. We're now looking for individuals with high-quality, fine-tuned GPT models that they would like to see on our platform. This is a great opportunity for those who have spent time and effort fine-tuning models to showcase their work to a wider audience, receive recognition, and monetize their work. If you have a fine-tuned GPT model that you would like to share on our platform, please reach out to us. We are always looking for new models to add to our platform, and we are eager to hear from you. Whether you have created a model that is specific to a particular domain or industry, or you have developed a model that is particularly advanced and highly specialized, we would love to hear from you. So, if you have a great fine-tuned GPT model, or you know someone who does, send me a message. We look forward to hearing from you soon! Thanks, submitted by /u/Aggravating_Art_173 [link] [comments]  ( 41 min )
    I put ChatGPT through the Marcel Proust Questionnaire
    submitted by /u/okanaganjournal [link] [comments]  ( 40 min )
    Student hacks new Bing chatbot search aka "Sydney"
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Story-telling AI that I could feed data to and then ask to write a similar replica?
    Hello, I was wondering if there's a free or premium story-telling AI model that I could feed data to, for example, passages from a particular author or pages from their book, and then ask the AI to create a story using that author's writing style, dictionary, or ideas. A while ago I watched a Youtube video, in which a person taught an AI to write screenplays in the style of a certain author and I'd like to do the same, except with short stories. Is it possible to do so without any coding knowledge? Thanks. submitted by /u/Basil1sk17 [link] [comments]  ( 41 min )
    Am I the only one who has become addicted to AI advancements news?
    All the AI news has managed to get me very excited but now I'm craving my next hit! We all know great things that will be possible in the future, but this rapid new cycle has left me craving my next fix of news. before I was content to glance at AI news and go oh that interesting now I'm bloodshot eyed scrolling twitter and YouTube to see if there is any new advancements that I missed... It's not just me right? submitted by /u/Nargodian [link] [comments]  ( 42 min )
    Would this concept be possible with AI?
    I've recently developed an interest in AI and plan on learning how to make my own, starting simple and getting more complex as I go. My question is, if you can train an AI on any data you want to get an output, could you in theory train an AI with source and compiled code with the intention of getting the AI to spit out an approximation source code for a compiled program? Its an interesting concept to toy with, but I'm not sure if it's viable. Let me know what you think! submitted by /u/OmaeWaWakiYaku [link] [comments]  ( 41 min )
    [OC] "Théâtre D'opéra Spatial by Midjourney (with the help of a human)"
    submitted by /u/AdministrativeLet996 [link] [comments]  ( 41 min )
    Microsoft Prometheus is GPT-4 in Software (Is GPT-4 inside Prometheus Model?)
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    Google's Bard AI ChatBot Wiped Off $100 Billion In Market Cap After Factual Error In First Demo
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 43 min )
    TEXTure: Text to 3D textures
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    What are the best resources to stay up to date with the latest news ?
    I am very interested in the field of artificial intelligence. Can you recommend me some YouTube channels, websites/blogs, magazines and other places on the internet where I would be informed of the latest news in this environment ? submitted by /u/HelloWorldCLang [link] [comments]  ( 42 min )
    A.I. Rick & Morty Analyzing Real-Time A.I. Generated Poster Art
    submitted by /u/MindCluster [link] [comments]  ( 40 min )
  • Open

    How do you/should you put less importance on actions which have less steps after them due to arbitrary episode length?
    For example, in an environment a game might take 1,000 steps and actions might take some time to realize reward but realize cost immediately. An action taken at step 998 might be a good action if the game continued beyond 1000 steps, but it could get punished because the game stops arbitrarily early. Is there any logic in just culling the last n actions/discounted returns? submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Curriculum vs. hierarchical RL
    Title says it all. I am wondering about the difference and the relationship (if at all) between curriculum RL and hierarchical RL. Can anyone enlighten me please? The concepts seem to have similar themes Thanks in advance! submitted by /u/acorntje [link] [comments]  ( 42 min )
    What exploration-hard or sparse reward environment do you recommend for research?
    submitted by /u/OutOfCharm [link] [comments]  ( 41 min )
    RL agent beating all main bosses of Mega Man X4
    submitted by /u/victorsevero [link] [comments]  ( 44 min )
  • Open

    Detect signatures on documents or images using the signatures feature in Amazon Textract
    Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Signatures is a feature within Amazon Textract that offers the ability to automatically detect signatures on any document. This can reduce the need for human review, custom code, or ML experience. In this post, […]  ( 7 min )
    Monitoring Lake Mead drought using the new Amazon SageMaker geospatial capabilities
    Earth’s changing climate poses an increased risk of drought due to global warming. Since 1880, the global temperature has increased 1.01 °C. Since 1993, sea levels have risen 102.5 millimeters. Since 2002, the land ice sheets in Antarctica have been losing mass at a rate of 151.0 billion metric tons per year. In 2022, the […]  ( 10 min )
  • Open

    Amplification at the Quantum limit
    Posted by Ted White and Ofer Naaman, Staff Research Scientists, Google Quantum AI The Google Quantum AI team is building quantum computers with superconducting microwave circuits, but much like a classical computer the superconducting processor at the heart of these computers is only part of the story. An entire technology stack of peripheral hardware is required to make the quantum computer work properly. In many cases these parts must be custom designed, requiring extensive research and development to reach the highest levels of performance. In this post, we highlight one aspect of this supplemental hardware: our superconducting microwave amplifiers. In “Readout of a Quantum Processor with High Dynamic Range Josephson Parametric Amplifiers”, published in Applied Physics Letters, w…  ( 92 min )
  • Open

    3 Questions: Leo Anthony Celi on ChatGPT and medicine
    The chatbot’s success on the medical licensing exam shows that the test — and medical education — are flawed, Celi says.  ( 8 min )
  • Open

    [D] Using LLMs as decision engines
    I just finished reading the paper "Pre-Trained Language Models for Interactive Decision Making" (https://arxiv.org/abs/2202.01771). As I understand it, the authors are using a language model to "generate" an optimal path to an objective, in test environments like VirtualHome and BabyAI. Reinforcement and imitation learning are evaluated as ways for the model to self-improve. This is the first time I've seen a language model being used to "solve a problem" that isn't a language one. It seems to open up so many new possibilties. Has this been done before? Are there other examples of LMs being used as decision engines? What's the state of the art? Any interesting applications you've seen? Side question: I imagine there were AI approaches to navigating VirtualHome and BabyAI that were NOT language-model based. What is the standard modeling approach to these kinds of problems? submitted by /u/These-Assignment-936 [link] [comments]  ( 44 min )
    [D] Plot Best Run for Accuracy or Mean across runs?
    I've ran two image classification model 5 times on a dataset. Model A has a mean best accuracy of 95.03% while Model B has a mean best accuracy of 95.3% However, Model A has a max best accuracy of 95.75% while Model B has a mean best accuracy of 95.5%. I am wanting to report these results in a paper to a conference/journal. When plotting the test accuracy per epoch, should I only report the results for the best run or should I take the mean of the test accuracies over all 5 runs per epoch for plotting? submitted by /u/MyActualUserName99 [link] [comments]  ( 43 min )
    [D] Dense correspondence between image/mesh
    The goal is to create a model that can make correspondences between images and meshes. Just like we see in image registration, where we have an image next to each other and then N matches (lines) that show similar features. But in this case will be an image and a mesh of a specific object. Have you some tips and ideas that could help me how to attack this problem? submitted by /u/henistein [link] [comments]  ( 43 min )
    [Discussion] Looking for opinions: Scale Spellbook vs. Snorkel Flow vs....?
    Hey everyone. Has anyone used Snorkel Flow, Scale Spellbook or other alternatives (please advise) to test multiple foundation models and migrate between them? E.g. comparing GPT3 vs GPT-J or GPT-Neo etc. Need help moving to a smaller/cheaper model - cheers! submitted by /u/fourcornerclub [link] [comments]  ( 42 min )
    [D] Similarity b/w two vectors
    how to calculate similarity between two vectors? I want a similarity metric that take into accounts both the directions and magnitudes of vectors. submitted by /u/TKMater [link] [comments]  ( 43 min )
    [D]Image Recognition ability of machine learning in financial markets questions
    I want to understand if it is possible to build an image recognition system. Let me explain. Take a historical price chart of the stock market but make the background black and the price white in the form of bars (OHLC). Keep the scale the same for all, let's say 100 bars of data - the top of the scale is 1 and the bottom 0, as to not bring actual price into play. Now assume you could capture these images over and over again through historic price charts. They all have the same amount of bars, color, and scaling. Now insert a current stock chart 'X'. Is it possible to have this database review all of the charts and give the charts of the past that are most similar to the current chart now? And include some sort of similarities correlation [90%,77%, etc.] Here is what I'm not looking for: For it to predict the future For it to tell me to buy and sell What I am looking for is: Similarities of past prices The correlation of the similarities A list of the top [10,20,30] charts and their correlation of them to the current. I understand past historical prices are different than future and economic data is complex and many variables are at play as well. submitted by /u/Ready-Acanthaceae970 [link] [comments]  ( 45 min )
    [D] Story-telling AI that I could feed data to and then ask to write a similar replica?
    Hello, I was wondering if there's a free or premium story-telling AI model that I could feed data to, for example, passages from a particular author or pages from their book, and then ask the AI to create a story using that author's writing style, dictionary, or ideas. A while ago I watched a Youtube video, in which a person taught an AI to write screenplays in the style of a certain author and I'd like to do the same, except with short stories. Is it possible to do so without any coding knowledge? Thanks. submitted by /u/Basil1sk17 [link] [comments]  ( 43 min )
    [D] Constrained Optimization in Deep Learning
    Clearly, large scale deep learning approaches in image classification or NLP use all sorts of Regularization mechanisms, but the parameters are typically unconstrained (i.e., every weight can theoretically attain any real value). In many Machine Learning domains, constrained optimization (e.g. via Projected Gradient Descent or Frank-Wolfe) plays a huge role. I was wondering whether there are large-scale Deep Learning applications which rely on constrained optimization approaches? When I say large-scale, I mean large CNNs, transformers, diffusion models or the like. Are there settings where constrained optimization would even be a preferred approach, but not efficient/stable enough? Happy for any paper suggestions or thoughts! Thanks! submitted by /u/d0cmorris [link] [comments]  ( 44 min )
    [D] Latent spaces and weather forecasting/nowcasting
    Hi, everyone. First time, long time. My background is weather analysis to DL applications in weather, and I had a question I wanted to ask the community writ large. The question is about latent spaces and how, specifically, the DeepMind group used them in their radar nowcasting model DGMR (see links to prior threads below). In the DGMR paper itself (https://www.nature.com/articles/s41586-021-03854-z), the architecture looks like a U-net with some ConvGRU2D flair in the decoder and some temporal consistency checks from the discriminator. There is also what they call a "latent conditioning stack." From some deeper readings, I think the model is a descendant of BigGAN, since both use an explicit latent space among other similarities. This leads to my question and general curiosity... How is this latent space seeded? My prior experience with latent space toy models (DCGAN, for example) is that unless you seed the RNG explicitly, then performing a restart of the model to continue training mucks up the distribution. Fairly standard RNG issues. Is it really as simple as, for example, latent_vector = tf.random.truncated_normal([batch_size, grid_size_parameters], seed=42) I feel like I'm missing something. Why does this work at all? Why is a latent space necessary in this context? They state explicitly in their paper that they require this stack to generalize results to datasets that are larger (in a HxW sense) than the one on which they trained, but I can't wrap my head around why an extended latent vector for a larger grid size works. If anyone can point me in the right direction or help me understand, I'd greatly appreciate it. Links to prior threads: https://www.reddit.com/r/MachineLearning/comments/pyfjz7/r_deepminds_weather_forecasting_model_nowcasting/ https://old.reddit.com/r/MachineLearning/comments/py0289/r_skilful_precipitation_nowcasting_using_deep/ submitted by /u/PsyEclipse [link] [comments]  ( 44 min )
    [D] Can we use Ray for distributed training on vertex ai ? Can someone provide me examples for the same ? Also which dataframe libraries you guys used for training machine learning models on huge datasets (100 gb+) (because pandas can't handle huge data).
    Same as title. Also the dataframe library should support machine learning libraries submitted by /u/RstarPhoneix [link] [comments]  ( 42 min )
    [D] RTX 3090 with i7 7700k, training bottleneck
    Hey guys I have an older PC(5 years) with an i7 7700k processor. I want to buy an Nvidia RTX 3090 for training large language models. I can t find any benchmark for CPU bottleneck when training, let s say an GPT 2 large model. Has anyone have any experience with this set-up similar set-up ? submitted by /u/Available_Lion_652 [link] [comments]  ( 45 min )
    [R] Research Seminar by Neural Magic: AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks
    At Neural Magic, we are proud to be at the forefront of cutting-edge machine learning research, with a particular focus on model compression. Our internal Lunch and Learn seminars are a weekly opportunity for our team to share their research and collaborate on new ideas. We believe in the importance of open-source contributions, which is why we are thrilled to announce that for a second time, we are opening the seminar to the wider community. On Wednesday, February 23, 2022, I will be sharing our work on AC/DC, a framework for sparse-training models. This research was done in partnership with IST Austria. Join me and the Neural Magic team for this exciting presentation and be sure to keep an eye out for future speakers in the coming months! ​ You can reserve your spot for the presentation here. submitted by /u/dtransposed [link] [comments]  ( 43 min )
    [P] Creating an embedding from a CNN
    Hi all: I have trained a CNN (efficietnet-b3) to classify the degree of a disease on medical images. I would like to create an embedding both to visualize relationships between images (after projecting to 2d or 3d-space) and to find similar images to one given. I have tried using the output of the last convolution both before and after pooling for all train images (~30.000) but the result is mediocre: images non-alike are quite close in the embedding and plotting it in 2 or 3d just show a point cloud with no obvious pattern. I have also tried to use the class activation map (the output of the convolutional layer after pooling and multiplying by the weights of the classifier of the predicted class). This is quite better, but class are not separated too clearly in the scatterplot. Is there any other sensible way to generate the embeddings? I have tried using the hidden representation of earlier convolutional layers, but some of them are so huge (~650.000 features per sample) creating a reasonable sized embedding would require very aggressive PCA. ​ Example of the scatter plot of the heatmap embedding. While it is okayish (classes are more or less spatially localized) it would be great to find an embedding that creates more visible clusters for each class. https://preview.redd.it/l7smdyuml6ha1.png?width=543&format=png&auto=webp&s=fcc757246fca79ca9fd1378562739c05baf6697e ​ submitted by /u/zanzagaes2 [link] [comments]  ( 45 min )
    [PROJECT] Text cluster of MxN dimensions as training set for AI?
    I have a text clustering project. It clusters texts in MxN dimensions. M is a subset of N, where N is the total number of domains. The text corpus is a set of academic papers. The clusters are cross disciplinary subjects, defined by M. Clusters are identified by MANOVA tests of sets of cross products. Goal is to identify texts of interest for research. E.g. identify clusters of papers relevant to a combination of subjects, or identify areas of research by their cluster, or identify outlier research. This is a N versus NP problem. It requires a great deal of processing time to cluster texts. I may do so for a corpus of 10k research papers, but that is a static set, and papers cannot be appended to the corpus without affecting all other clusters of the corpus. So I am considering creating a training set of 10k papers, and writing an AI to identify and cluster texts without comparing it to the rest of the corpus. I want feedback and ideas. I wont specify what I am looking for yet, because I am certain some of the responses here will point out something I did not consider. So please, comment with your thoughts. Tell me what you know. Give me your ideas. submitted by /u/Smedskjaer [link] [comments]  ( 43 min )
    [D] Bees: a new unit of measurement for ML model size
    Would like to hear about what you guys think about this approach? submitted by /u/ThePerson654321 [link] [comments]  ( 43 min )
    [P] Get 2x Faster Transcriptions with OpenAI Whisper Large on Kernl
    We are happy to announce the support of OpenAI Whisper model (ASR task) on Kernl. We focused on high quality transcription in a latency sensitive scenario, meaning: whisper-large-v2 weights beam search 5 (as recommended in the related paper) We measured a 2.3x speedup on Nvidia A100 GPU (2.4x on 3090 RTX) compared to Hugging Face implementation using FP16 mixed precision on transcribing librispeech test set (over 2600 examples). For now, OpenAI implementation is not yet PyTorch 2.0 compliant. In the post below, we discuss what worked (CUDA Graph), our tricks (to significantly reduce memory footprint), and what did not pay off (Flash attention and some other custom Triton kernels). Kernl repository: https://github.com/ELS-RD/kernl Reproduction script: https://github.com/ELS-RD…  ( 62 min )
    [D] Format for ICML tutorial submission?
    Hello! I'm quite new to this. I was wondering what the right format is for submitting a successful tutorial proposal. Should I just use the LaTeX style files but modify the content for a tutorial proposal? submitted by /u/Acceptable_League160 [link] [comments]  ( 6 min )
    [D] Are there emergent abilities of image models?
    Just finished reading the Stanford/Google survey paper (https://arxiv.org/abs/2206.07682) on emergent abilities of large language models. It made me wonder: do image generation models have emergent abilities, too? Do we know? I can't quite wrap my head around what such an ability would even look like. Figured maybe other folks had given this a think. submitted by /u/These-Assignment-936 [link] [comments]  ( 46 min )
    [D] Are there any AI model that I can use to improve very bad quality sound recording? Removing noise and improving overall quality
    I have got old lecture recordings I want to improve their sound quality I have tested adobe AI noise removal but not very good I also tested descript studio sound not very good either I wonder if there are any public model, github repo, github project, hugging face repo that I can use to remove noise and improve sound quality of existing audio recordings? Thank you so much for replies Recordings are in English Here example recording that needs to be cleaned 5 min audio : https://sndup.net/stjs/ full lecture : https://youtu.be/2zY1dQDGl3o submitted by /u/CeFurkan [link] [comments]  ( 45 min )
  • Open

    GPT in 60 Lines of NumPy
    submitted by /u/nickb [link] [comments]  ( 40 min )
    Using Model Weights in Another Model
    Hello everyone, I wrote two models from scratch. I want to use first model weights in the second model. However, these two models have not same architecture. I wrote the code as follows: # Load the pre-trained model model_CovidNet = load_model("CovidNet_Model.h5") # Access the weights CovidNet_weights = model_CovidNet.get_weights() This my first layer in the model and I want to use the weights as follows: x = Conv2D(64, 7, strides = 2, padding = 'same', kernel_initializer = CovidNet_weight)(input) I am getting dimension error. How can I solve this problem? How can I use other model weights in my other model? I'd be appreciate to any kind help. submitted by /u/Hungry-Engineer-5696 [link] [comments]  ( 41 min )
    Researchers Discover a More Flexible Approach to Machine Learning: “Liquid” neural nets, based on a worm’s nervous system
    submitted by /u/nickb [link] [comments]  ( 41 min )
    TEXTure: Text to 3D textures
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Modifying and Reducing Threshold Values with Each Iteration
    Are there any examples of neural networks that, when training, reduce a threshold value when comparing what they get to the correct output each time they iterate? Say a neural network is doing number recognition. Are there any examples of a neural network requiring less and less error in identifying the numbers for more and more iterations through the neural network? This would mean that the neural network must spend more and more time training the parameters(or hyperparameters in some cases I suppose) for every iteration it has performed unless the error is reduced substantially. submitted by /u/AstroBullivant [link] [comments]  ( 41 min )
  • Open

    Crossing Continents: XPENG G9 SUV and P7 Sedan Set Course for Scandinavia, the Netherlands
    Electric automaker XPENG’s flagship G9 SUV and P7 sports sedan are now available for order in Sweden, Denmark, Norway and the Netherlands — an expansion revealed last week at the eCar Expo in Stockholm. The intelligent electric vehicles are built on the high-performance NVIDIA DRIVE Orin centralized compute architecture and deliver AI capabilities that are Read article >  ( 5 min )
    3D Artist Brings Ride and Joy to Automotive Designs With Real-Time Renders Using NVIDIA RTX
    Designing automotive visualizations can be incredibly time consuming. To make the renders look as realistic as possible, artists need to consider material textures, paints, realistic lighting and reflections, and more. For 3D artist David Baylis, it’s important to include these details and still create high-resolution renders in a short amount of time. That’s why he Read article >  ( 6 min )
    Gather Your Party: GFN Thursday Brings ‘Baldur’s Gate 3’ to the Cloud
    Venture to the Forgotten Realms this GFN Thursday in Baldur’s Gate 3, streaming on GeForce NOW. Celebrations for the cloud gaming service’s third anniversary continue with a Dying Light 2 reward that’s to die for. It’s the cherry on top of three new titles joining the GeForce NOW library this week. Roll for Initiative Mysterious Read article >  ( 5 min )
  • Open

    Recognizing squares
    Suppose you’re given a number and you’d like to tell whether its a square, or at least you’d like to be able to determine quickly if it’s not a square. This post began as a thread I wrote on Twitter. For starters, the last digit of a square in base 10 must be 0, 1, […] Recognizing squares first appeared on John D. Cook.  ( 5 min )
  • Open

    Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications. (arXiv:2302.03683v1 [cs.LG])
    Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.  ( 2 min )
    Fuzzy Expert Systems for Prediction of ICU Admission in Patients with COVID-19. (arXiv:2104.12868v2 [cs.LG] UPDATED)
    The pandemic COVID-19 disease has had a dramatic impact on almost all countries around the world so that many hospitals have been overwhelmed with Covid-19 cases. As medical resources are limited, deciding on the proper allocation of these resources is a very crucial issue. Besides, uncertainty is a major factor that can affect decisions, especially in medical fields. To cope with this issue, we use fuzzy logic (FL) as one of the most suitable methods in modeling systems with high uncertainty and complexity. We intend to make use of the advantages of FL in decisions on cases that need to treat in ICU. In this study, an interval type-2 fuzzy expert system is proposed for prediction of ICU admission in COVID-19 patients. For this prediction task, we also developed an adaptive neuro-fuzzy inference system (ANFIS). Finally, the results of these fuzzy systems are compared to some well-known classification methods such as Naive Bayes (NB), Case-Based Reasoning (CBR), Decision Tree (DT), and K Nearest Neighbor (KNN). The results show that the type-2 fuzzy expert system and ANFIS models perform competitively in terms of accuracy and F-measure compared to the other system modeling techniques.  ( 2 min )
    A Survey of Supernet Optimization and its Applications: Spatial and Temporal Optimization for Neural Architecture Search. (arXiv:2204.03916v2 [cs.CV] UPDATED)
    This survey focuses on categorizing and evaluating the methods of supernet optimization in the field of Neural Architecture Search (NAS). Supernet optimization involves training a single, over-parameterized network that encompasses the search space of all possible network architectures. The survey analyses supernet optimization methods based on their approaches to spatial and temporal optimization. Spatial optimization relates to optimizing the architecture and parameters of the supernet and its subnets, while temporal optimization deals with improving the efficiency of selecting architectures from the supernet. The benefits, limitations, and potential applications of these methods in various tasks and settings, including transferability, domain generalization, and Transformer models, are also discussed.  ( 2 min )
    Projective Ranking-based GNN Evasion Attacks. (arXiv:2202.12993v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) offer promising learning methods for graph-related tasks. However, GNNs are at risk of adversarial attacks. Two primary limitations of the current evasion attack methods are highlighted: (1) The current GradArgmax ignores the "long-term" benefit of the perturbation. It is faced with zero-gradient and invalid benefit estimates in certain situations. (2) In the reinforcement learning-based attack methods, the learned attack strategies might not be transferable when the attack budget changes. To this end, we first formulate the perturbation space and propose an evaluation framework and the projective ranking method. We aim to learn a powerful attack strategy then adapt it as little as possible to generate adversarial samples under dynamic budget settings. In our method, based on mutual information, we rank and assess the attack benefits of each perturbation for an effective attack strategy. By projecting the strategy, our method dramatically minimizes the cost of learning a new attack strategy when the attack budget changes. In the comparative assessment with GradArgmax and RL-S2V, the results show our method owns high attack performance and effective transferability. The visualization of our method also reveals various attack patterns in the generation of adversarial samples.  ( 2 min )
    Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions. (arXiv:2205.00415v2 [cs.CL] UPDATED)
    In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.  ( 2 min )
    Recent advances in the Self-Referencing Embedding Strings (SELFIES) library. (arXiv:2302.03620v1 [physics.chem-ph])
    String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencIng Embedded Strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of \selfieslib, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of \selfieslib (version 2.1.1) in this manuscript.  ( 2 min )
    Automated Huntington's Disease Prognosis via Biomedical Signals and Shallow Machine Learning. (arXiv:2302.03605v1 [eess.SP])
    Huntington's disease (HD) is a rare, genetically-determined brain disorder that limits the life of the patient, although early prognosis of HD can substantially improve the patient's quality of life. Current HD prognosis methods include using a variety of complex biomarkers such as clinical and imaging factors, however these methods have many shortfalls, such as their resource demand and failure to distinguish symptomatic and asymptomatic patients. Quantitative biomedical signaling has been used for diagnosis of other neurological disorders such as schizophrenia, and has potential for exposing abnormalities in HD patients. In this project, we used a premade, certified dataset collected at a clinic with 27 HD positive patients, 36 controls, and 6 unknowns with electroencephalography, electrocardiography, and functional near-infrared spectroscopy data. We first preprocessed the data and extracted a variety of features from both the transformed and raw signals, after which we applied a plethora of shallow machine learning techniques. We found the highest accuracy was achieved by a scaled-out Extremely Randomized Trees algorithm, with area under the curve of the receiver operator characteristic of 0.963 and accuracy of 91.353%. The subsequent feature analysis showed that 60.865% of the features had p<0.05, with the features from the raw signal being most significant. The results indicate the promise of neural and cardiac signals for marking abnormalities in HD, as well as evaluating the progression of the disease in  ( 2 min )
    Action Matching: Learning Stochastic Dynamics from Samples. (arXiv:2210.06662v2 [cs.LG] UPDATED)
    Learning the continuous dynamics of a system from snapshots of its time evolution is a problem which appears throughout natural sciences and machine learning, including in quantum systems, single-cell biological data, and generative modeling. In these settings, we assume that only uncorrelated samples rather than full trajectory data are available. In order to better understand the systems under observation, we would like to learn a model of the underlying process that allows us to propagate samples in time and thereby simulate entire individual trajectories. In this work, we propose Action Matching, a method for learning a rich family of dynamics using only independent samples from its time evolution. We derive a tractable training objective, which does not rely on explicit assumptions about the underlying dynamics and does not require back-propagation through differential equation or optimal transport solvers. Inspired by connections with optimal transport, we derive extensions of Action Matching to learn stochastic differential equations and dynamics involving creation or destruction of probability mass. Finally, we showcase applications of Action Matching by achieving competitive performance in a diverse set of experiments from biology, physics, and generative modeling.
    Black-Box Testing of Deep Neural Networks Through Test Case Diversity. (arXiv:2112.12591v5 [cs.SE] UPDATED)
    Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using four datasets and five DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs.
    Deep Linear Networks can Benignly Overfit when Shallow Ones Do. (arXiv:2209.09315v2 [cs.LG] UPDATED)
    We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm's ability to "hide the noise". Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced.
    Accelerated Nonnegative Tensor Completion via Integer Programming. (arXiv:2211.15770v2 [cs.LG] UPDATED)
    The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices.
    Causal Effect Identification in Cluster DAGs. (arXiv:2202.12263v2 [stat.ME] UPDATED)
    Reasoning about the effect of interventions and counterfactuals is a fundamental task found throughout the data sciences. A collection of principles, algorithms, and tools has been developed for performing such tasks in the last decades (Pearl, 2000). One of the pervasive requirements found throughout this literature is the articulation of assumptions, which commonly appear in the form of causal diagrams. Despite the power of this approach, there are significant settings where the knowledge necessary to specify a causal diagram over all variables is not available, particularly in complex, high-dimensional domains. In this paper, we introduce a new graphical modeling tool called cluster DAGs (for short, C-DAGs) that allows for the partial specification of relationships among variables based on limited prior knowledge, alleviating the stringent requirement of specifying a full causal diagram. A C-DAG specifies relationships between clusters of variables, while the relationships between the variables within a cluster are left unspecified, and can be seen as a graphical representation of an equivalence class of causal diagrams that share the relationships among the clusters. We develop the foundations and machinery for valid inferences over C-DAGs about the clusters of variables at each layer of Pearl's Causal Hierarchy (Pearl and Mackenzie 2018; Bareinboim et al. 2020) - L1 (probabilistic), L2 (interventional), and L3 (counterfactual). In particular, we prove the soundness and completeness of d-separation for probabilistic inference in C-DAGs. Further, we demonstrate the validity of Pearl's do-calculus rules over C-DAGs and show that the standard ID identification algorithm is sound and complete to systematically compute causal effects from observational data given a C-DAG. Finally, we show that C-DAGs are valid for performing counterfactual inferences about clusters of variables.
    Distribution estimation and change-point estimation for time series via DNN-based GANs. (arXiv:2211.14577v2 [cs.LG] UPDATED)
    The generative adversarial networks (GANs) have recently been applied to estimating the distribution of independent and identically distributed data, and have attracted a lot of research attention. In this paper, we use the blocking technique to demonstrate the effectiveness of GANs for estimating the distribution of stationary time series. Theoretically, we derive a non-asymptotic error bound for the Deep Neural Network (DNN)-based GANs estimator for the stationary distribution of the time series. Based on our theoretical analysis, we propose an algorithm for estimating the change point in time series distribution. The two main results are verified by two Monte Carlo experiments respectively, one is to estimate the joint stationary distribution of $5$-tuple samples of a 20 dimensional AR(3) model, the other is about estimating the change point at the combination of two different stationary time series. A real world empirical application to the human activity recognition dataset highlights the potential of the proposed methods.
    FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective. (arXiv:2209.13803v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning framework to alleviate the data silos, where decentralized clients collaboratively learn a global model without sharing their private data. However, the clients' Non-Independent and Identically Distributed (Non-IID) data negatively affect the trained model, and clients with different numbers of local updates may cause significant gaps to the local gradients in each communication round. In this paper, we propose a Federated Vectorized Averaging (FedVeca) method to address the above problem on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v3 [cs.LG] UPDATED)
    It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., $\sigma_t \equiv \Omega(1)$) and the benign deterministic regimes (i.e., $\sigma_t \equiv 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a "black-box" manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.
    Compact Graph Representation of crystal structures using Point-wise Distance Distributions. (arXiv:2212.11246v2 [physics.comp-ph] UPDATED)
    Use of graphs to represent crystal structures has become popular in recent years as they provide a natural translation from atoms and bonds to nodes and edges. Graphs capture structure, while remaining invariant to the symmetries that crystals display. Several works in property prediction, including those with state-of-the-art results, make use of the Crystal Graph. The present work offers a graph based on Point-wise Distance Distributions which retains symmetrical invariance, decreases computational load, and yields similar or better prediction accuracy on both experimental and simulated crystals.
    Learning Continuous Rotation Canonicalization with Radial Beam Sampling. (arXiv:2206.10690v2 [cs.CV] UPDATED)
    Nearly all state of the art vision models are sensitive to image rotations. Existing methods often compensate for missing inductive biases by using augmented training data to learn pseudo-invariances. Alongside the resource demanding data inflation process, predictions often poorly generalize. The inductive biases inherent to convolutional neural networks allow for translation equivariance through kernels acting parallely to the horizontal and vertical axes of the pixel grid. This inductive bias, however, does not allow for rotation equivariance. We propose a radial beam sampling strategy along with radial kernels operating on these beams to inherently incorporate center-rotation covariance. Together with an angle distance loss, we present a radial beam-based image canonicalization model, short BIC. Our model allows for maximal continuous angle regression and canonicalizes arbitrary center-rotated input images. As a pre-processing model, this enables rotation-invariant vision pipelines with model-agnostic rotation-sensitive downstream predictions. We show that our end-to-end trained angle regressor is able to predict continuous rotation angles on several vision datasets, i.e. FashionMNIST, CIFAR10, COIL100, and LFW.
    Towards Out-of-Distribution Adversarial Robustness. (arXiv:2210.03150v3 [cs.LG] UPDATED)
    Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach. Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4% the best existing baseline to 25.9% on MNIST, and from 16.9% to 23.5% on CIFAR10.
    Take One Gram of Neural Features, Get Enhanced Group Robustness. (arXiv:2208.12625v2 [cs.LG] UPDATED)
    Predictive performance of machine learning models trained with empirical risk minimization (ERM) can degrade considerably under distribution shifts. The presence of spurious correlations in training datasets leads ERM-trained models to display high loss when evaluated on minority groups not presenting such correlations. Extensive attempts have been made to develop methods improving worst-group robustness. However, they require group information for each training input or at least, a validation set with group labels to tune their hyperparameters, which may be expensive to get or unknown a priori. In this paper, we address the challenge of improving group robustness without group annotation during training or validation. To this end, we propose to partition the training dataset into groups based on Gram matrices of features extracted by an ``identification'' model and to apply robust optimization based on these pseudo-groups. In the realistic context where no group labels are available, our experiments show that our approach not only improves group robustness over ERM but also outperforms all recent baselines
    Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting. (arXiv:2206.04038v4 [cs.LG] UPDATED)
    The performance of time series forecasting has recently been greatly improved by the introduction of transformers. In this paper, we propose a general multi-scale framework that can be applied to the state-of-the-art transformer-based time series forecasting models (FEDformer, Autoformer, etc.). By iteratively refining a forecasted time series at multiple scales with shared weights, introducing architecture adaptations, and a specially-designed normalization scheme, we are able to achieve significant performance improvements, from 5.5% to 38.5% across datasets and transformer architectures, with minimal additional computational overhead. Via detailed ablation studies, we demonstrate the effectiveness of each of our contributions across the architecture and methodology. Furthermore, our experiments on various public datasets demonstrate that the proposed improvements outperform their corresponding baseline counterparts. Our code is publicly available in https://github.com/BorealisAI/scaleformer.
    Less is More: Understanding Word-level Textual Adversarial Attack via n-gram Frequency Descend. (arXiv:2302.02568v2 [cs.CL] UPDATED)
    Word-level textual adversarial attacks have achieved striking performance in fooling natural language processing models. However, the fundamental questions of why these attacks are effective, and the intrinsic properties of the adversarial examples (AEs), are still not well understood. This work attempts to interpret textual attacks through the lens of $n$-gram frequency. Specifically, it is revealed that existing word-level attacks exhibit a strong tendency toward generation of examples with $n$-gram frequency descend ($n$-FD). Intuitively, this finding suggests a natural way to improve model robustness by training the model on the $n$-FD examples. To verify this idea, we devise a model-agnostic and gradient-free AE generation approach that relies solely on the $n$-gram frequency information, and further integrate it into the recently proposed convex hull framework for adversarial training. Surprisingly, the resultant method performs quite similarly to the original gradient-based method in terms of model robustness. These findings provide a human-understandable perspective for interpreting word-level textual adversarial attacks, and a new direction to improve model robustness.
    A Data Driven Method for Multi-step Prediction of Ship Roll Motion in High Sea States. (arXiv:2207.12673v3 [cs.LG] UPDATED)
    Ship roll motion in high sea states has large amplitudes and nonlinear dynamics, and its prediction is significant for operability, safety, and survivability. This paper presents a novel data-driven methodology to provide a multi-step prediction of ship roll motions in high sea states. A hybrid neural network is proposed that combines long short-term memory (LSTM) and convolutional neural network (CNN) in parallel. The motivation is to extract the nonlinear dynamic characteristics and the hydrodynamic memory information through the advantage of CNN and LSTM, respectively. For the feature selection, the time histories of motion states and wave heights are selected to involve sufficient information. Taken a scaled KCS as the study object, the ship motions in sea state 7 irregular long-crested waves are simulated and used for the validation. The results show that at least one period of roll motion can be accurately predicted. Compared with the single LSTM and CNN methods, the proposed method has better performance in predicting the amplitude of roll angles. Besides, the comparison results also demonstrate that selecting motion states and wave heights as feature space improves the prediction accuracy, verifying the effectiveness of the proposed method.
    Symmetric Pruning in Quantum Neural Networks. (arXiv:2208.14057v2 [quant-ph] UPDATED)
    Many fundamental properties of a quantum system are captured by its Hamiltonian and ground state. Despite the significance of ground states preparation (GSP), this task is classically intractable for large-scale Hamiltonians. Quantum neural networks (QNNs), which exert the power of modern quantum machines, have emerged as a leading protocol to conquer this issue. As such, how to enhance the performance of QNNs becomes a crucial topic in GSP. Empirical evidence showed that QNNs with handcraft symmetric ansatzes generally experience better trainability than those with asymmetric ansatzes, while theoretical explanations have not been explored. To fill this knowledge gap, here we propose the effective quantum neural tangent kernel (EQNTK) and connect this concept with over-parameterization theory to quantify the convergence of QNNs towards the global optima. We uncover that the advance of symmetric ansatzes attributes to their large EQNTK value with low effective dimension, which requests few parameters and quantum circuit depth to reach the over-parameterization regime permitting a benign loss landscape and fast convergence. Guided by EQNTK, we further devise a symmetric pruning (SP) scheme to automatically tailor a symmetric ansatz from an over-parameterized and asymmetric one to greatly improve the performance of QNNs when the explicit symmetry information of Hamiltonian is unavailable. Extensive numerical simulations are conducted to validate the analytical results of EQNTK and the effectiveness of SP.
    N-Gram Nearest Neighbor Machine Translation. (arXiv:2301.12866v2 [cs.CL] UPDATED)
    Nearest neighbor machine translation augments the Autoregressive Translation~(AT) with $k$-nearest-neighbor retrieval, by comparing the similarity between the token-level context representations of the target tokens in the query and the datastore. However, the token-level representation may introduce noise when translating ambiguous words, or fail to provide accurate retrieval results when the representation generated by the model contains indistinguishable context information, e.g., Non-Autoregressive Translation~(NAT) models. In this paper, we propose a novel $n$-gram nearest neighbor retrieval method that is model agnostic and applicable to both AT and NAT models. Specifically, we concatenate the adjacent $n$-gram hidden representations as the key, while the tuple of corresponding target tokens is the value. In inference, we propose tailored decoding algorithms for AT and NAT models respectively. We demonstrate that the proposed method consistently outperforms the token-level method on both AT and NAT models as well on general as on domain adaptation translation tasks. On domain adaptation, the proposed method brings $1.03$ and $2.76$ improvements regarding the average BLEU score on AT and NAT models respectively.
    Incremental Spectral Learning in Fourier Neural Operator. (arXiv:2211.15188v3 [cs.LG] UPDATED)
    Recently, neural networks have proven their impressive ability to solve partial differential equations (PDEs). Among them, Fourier neural operator (FNO) has shown success in learning solution operators for highly non-linear problems such as turbulence flow. FNO learns weights over different frequencies and as a regularization procedure, it only retains frequencies below a fixed threshold. However, manually selecting such an appropriate threshold for frequencies can be challenging, as an incorrect threshold can lead to underfitting or overfitting. To this end, we propose Incremental Fourier Neural Operator (IFNO) that incrementally adds frequency modes by increasing the truncation threshold adaptively during training. We show that IFNO reduces the testing loss by more than 10% while using 20% fewer frequency modes, compared to the standard FNO training on the Kolmogorov Flow (with Reynolds number up to 5000) under the few-data regime.
    Supervised Metric Learning to Rank for Retrieval via Contextual Similarity Optimization. (arXiv:2210.01908v2 [cs.LG] UPDATED)
    There is extensive interest in metric learning methods for image retrieval. Many metric learning loss functions focus on learning a correct ranking of training samples, but strongly overfit semantically inconsistent labels and require a large amount of data. To address these shortcomings, we propose a new metric learning method, called contextual loss, which optimizes contextual similarity in addition to cosine similarity. Our contextual loss implicitly enforces semantic consistency among neighbors while converging to the correct ranking. We empirically show that the proposed loss is more robust to label noise, and is less prone to overfitting even when a large portion of train data is withheld. Extensive experiments demonstrate that our method achieves a new state-of-the-art across four image retrieval benchmarks and multiple different evaluation settings. Code is available at: https://github.com/Chris210634/metric-learning-using-contextual-similarity
    Integral Continual Learning Along the Tangent Vector Field of Tasks. (arXiv:2211.13108v2 [cs.LG] UPDATED)
    We propose a lightweight continual learning method which incorporates information from specialized datasets incrementally, by integrating it along the vector field of "generalist" models. The tangent plane to the specialist model acts as a generalist guide and avoids the kind of over-fitting that leads to catastrophic forgetting, while exploiting the convexity of the optimization landscape in the tangent plane. It maintains a small fixed-size memory buffer, as low as 0.4% of the source datasets, which is updated by simple resampling. Our method achieves state-of-the-art across various buffer sizes for different datasets. Specifically, in the class-incremental setting we outperform the existing methods that do not require distillation by an average of 18.77% and 28.48%, for Seq-CIFAR-10 and Seq-TinyImageNet respectively. Our method can easily be combined with existing replay-based continual learning methods. When memory buffer constraints are relaxed to allow storage of metadata such as logits, we attain state-of-the-art accuracy with an error reduction of 17.84% towards the paragon performance on Seq-CIFAR-10.
    Neural-network solutions to stochastic reaction networks. (arXiv:2210.01169v2 [q-bio.MN] UPDATED)
    The stochastic reaction network in which chemical species evolve through a set of reactions is widely used to model stochastic processes in physics, chemistry and biology. To characterize the evolving joint probability distribution in the state space of species counts requires solving a system of ordinary differential equations, the chemical master equation, where the size of the counting state space increases exponentially with the type of species, making it challenging to investigate the stochastic reaction network. Here, we propose a machine-learning approach using the variational autoregressive network to solve the chemical master equation. Training the autoregressive network employs the policy gradient algorithm in the reinforcement learning framework, which does not require any data simulated in prior by another method. Different from simulating single trajectories, the approach tracks the time evolution of the joint probability distribution, and supports direct sampling of configurations and computing their normalized joint probabilities. We apply the approach to representative examples in physics and biology, and demonstrate that it accurately generates the probability distribution over time. The variational autoregressive network exhibits a plasticity in representing the multimodal distribution, cooperates with the conservation law, enables time-dependent reaction rates, and is efficient for high-dimensional reaction networks with allowing a flexible upper count limit. The results suggest a general approach to investigate stochastic reaction networks based on modern machine learning.
    Graph Neural Networks for Molecules. (arXiv:2209.05582v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs), which are capable of learning representations from graphical data, are naturally suitable for modeling molecular systems. This review introduces GNNs and their various applications for small organic molecules. GNNs rely on message-passing operations, a generic yet powerful framework, to update node features iteratively. Many researches design GNN architectures to effectively learn topological information of 2D molecule graphs as well as geometric information of 3D molecular systems. GNNs have been implemented in a wide variety of molecular applications, including molecular property prediction, molecular scoring and docking, molecular optimization and de novo generation, molecular dynamics simulation, etc. Besides, the review also summarizes the recent development of self-supervised learning for molecules with GNNs.
    RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness. (arXiv:2206.14502v2 [cs.LG] UPDATED)
    We show that the effectiveness of the well celebrated Mixup [Zhang et al., 2018] can be further improved if instead of using it as the sole learning objective, it is utilized as an additional regularizer to the standard cross-entropy loss. This simple change not only provides much improved accuracy but also significantly improves the quality of the predictive uncertainty estimation of Mixup in most cases under various forms of covariate shifts and out-of-distribution detection experiments. In fact, we observe that Mixup yields much degraded performance on detecting out-of-distribution samples possibly, as we show empirically, because of its tendency to learn models that exhibit high-entropy throughout; making it difficult to differentiate in-distribution samples from out-distribution ones. To show the efficacy of our approach (RegMixup), we provide thorough analyses and experiments on vision datasets (ImageNet & CIFAR-10/100) and compare it with a suite of recent approaches for reliable uncertainty estimation.
    Deep Reinforcement Learning for Traffic Light Control in Intelligent Transportation Systems. (arXiv:2302.03669v1 [cs.LG])
    Smart traffic lights in intelligent transportation systems (ITSs) are envisioned to greatly increase traffic efficiency and reduce congestion. Deep reinforcement learning (DRL) is a promising approach to adaptively control traffic lights based on the real-time traffic situation in a road network. However, conventional methods may suffer from poor scalability. In this paper, we investigate deep reinforcement learning to control traffic lights, and both theoretical analysis and numerical experiments show that the intelligent behavior ``greenwave" (i.e., a vehicle will see a progressive cascade of green lights, and not have to brake at any intersection) emerges naturally a grid road network, which is proved to be the optimal policy in an avenue with multiple cross streets. As a first step, we use two DRL algorithms for the traffic light control problems in two scenarios. In a single road intersection, we verify that the deep Q-network (DQN) algorithm delivers a thresholding policy; and in a grid road network, we adopt the deep deterministic policy gradient (DDPG) algorithm. Secondly, numerical experiments show that the DQN algorithm delivers the optimal control, and the DDPG algorithm with passive observations has the capability to produce on its own a high-level intelligent behavior in a grid road network, namely, the ``greenwave" policy emerges. We also verify the ``greenwave" patterns in a $5 \times 10$ grid road network. Thirdly, the ``greenwave" patterns demonstrate that DRL algorithms produce favorable solutions since the ``greenwave" policy shown in experiment results is proved to be optimal in a specified traffic model (an avenue with multiple cross streets). The delivered policies both in a single road intersection and a grid road network demonstrate the scalability of DRL algorithms.
    Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation. (arXiv:2302.03673v1 [cs.LG])
    We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters.
    Data-driven anisotropic finite viscoelasticity using neural ordinary differential equations. (arXiv:2302.03598v1 [cond-mat.soft])
    We develop a fully data-driven model of anisotropic finite viscoelasticity using neural ordinary differential equations as building blocks. We replace the Helmholtz free energy function and the dissipation potential with data-driven functions that a priori satisfy physics-based constraints such as objectivity and the second law of thermodynamics. Our approach enables modeling viscoelastic behavior of materials under arbitrary loads in three-dimensions even with large deformations and large deviations from the thermodynamic equilibrium. The data-driven nature of the governing potentials endows the model with much needed flexibility in modeling the viscoelastic behavior of a wide class of materials. We train the model using stress-strain data from biological and synthetic materials including humain brain tissue, blood clots, natural rubber and human myocardium and show that the data-driven method outperforms traditional, closed-form models of viscoelasticity.
    MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning. (arXiv:2209.08244v2 [cs.LG] UPDATED)
    Decentralized learning has shown great promise for cooperative multi-agent reinforcement learning (MARL). However, non-stationarity remains a significant challenge in fully decentralized learning. In the paper, we tackle the non-stationarity problem in the simplest and fundamental way and propose multi-agent alternate Q-learning (MA2QL), where agents take turns updating their Q-functions by Q-learning. MA2QL is a minimalist approach to fully decentralized cooperative MARL but is theoretically grounded. We prove that when each agent guarantees $\varepsilon$-convergence at each turn, their joint policy converges to a Nash equilibrium. In practice, MA2QL only requires minimal changes to independent Q-learning (IQL). We empirically evaluate MA2QL on a variety of cooperative multi-agent tasks. Results show MA2QL consistently outperforms IQL, which verifies the effectiveness of MA2QL, despite such minimal changes.
    A Cloud-Based Energy Management Strategy for Hybrid Electric City Bus Considering Real-Time Passenger Load Prediction. (arXiv:2010.15239v2 [eess.SY] UPDATED)
    Electric city bus gains popularity in recent years for its low greenhouse gas emission, low noise level, etc. Different from a passenger car, the weight of a city bus varies significantly with different amounts of onboard passengers. After analyzing the importance of battery aging and passenger load effects on an optimal energy management strategy, this study introduces the passenger load prediction into the hybrid-electric city buses energy management problem, which is not well studied in the existing literature. The average model, Decision Tree, Gradient Boost Decision Tree, and Neural Networks models are compared in the passenger load prediction. The Gradient Boost Decision Tree model is selected due to its best accuracy and high stability. Given the predicted passenger load, a dynamic programming algorithm determines the optimal power demand for supercapacitor and battery by optimizing the battery aging and energy usage leveraging cloud techniques. Then, rule extraction is conducted on dynamic programming results, and the rule is real-time loaded to the vehicle onboard controller to handle prediction errors and uncertainties. The proposed cloud-based Dynamic Programming and rule extraction framework with the passenger load prediction show 4% and 11% lower bus operating costs in off-peak and peak hours, respectively. The operating cost by the proposed framework is less than 1% of the dynamic programming with the true passenger load information.
    A Fuzzy-set-based Joint Distribution Adaptation Method for Regression and its Application to Online Damage Quantification for Structural Digital Twin. (arXiv:2211.02656v2 [cs.CE] UPDATED)
    Online damage quantification suffers from insufficient labeled data that weakens its accuracy. In this context, adopting the domain adaptation on historical labeled data from similar structures/damages or simulated digital twin data to assist the current diagnosis task would be beneficial. However, most domain adaptation methods are designed for classification and cannot efficiently address damage quantification, a regression problem with continuous real-valued labels. This study first proposes a novel domain adaptation method, the Online Fuzzy-set-based Joint Distribution Adaptation for Regression, to address this challenge. By converting the continuous real-valued labels to fuzzy class labels via fuzzy sets, the marginal and conditional distribution discrepancy are simultaneously measured to achieve the domain adaptation for the damage quantification task. Thanks to the superiority of the proposed method, a state-of-the-art online damage quantification framework based on domain adaptation is presented. Finally, the framework has been comprehensively demonstrated with a damaged helicopter panel, in which three types of damage domain adaptations (across different damage locations, across different damage types, and from simulation to experiment) are all conducted, proving the accuracy of damage quantification can be significantly improved in a realistic environment. It is expected that the proposed approach to be applied to the fleet-level digital twin considering the individual differences.
    Long Horizon Temperature Scaling. (arXiv:2302.03686v1 [cs.LG])
    Temperature scaling is a popular technique for tuning the sharpness of a model distribution. It is used extensively for sampling likely generations and calibrating model uncertainty, and even features as a controllable parameter to many large language models in deployment. However, autoregressive models rely on myopic temperature scaling that greedily optimizes the next token. To address this, we propose Long Horizon Temperature Scaling (LHTS), a novel approach for sampling from temperature-scaled joint distributions. LHTS is compatible with all likelihood-based models, and optimizes for the long-horizon likelihood of samples. We derive a temperature-dependent LHTS objective, and show that fine-tuning a model on a range of temperatures produces a single model capable of generation with a controllable long-horizon temperature parameter. We experiment with LHTS on image diffusion models and character/language autoregressive models, demonstrating advantages over myopic temperature scaling in likelihood and sample quality, and showing improvements in accuracy on a multiple choice analogy task by $10\%$.
    Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel. (arXiv:2205.12904v2 [cs.LG] UPDATED)
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.
    Federated Learning with Regularized Client Participation. (arXiv:2302.03662v1 [cs.LG])
    Federated Learning (FL) is a distributed machine learning approach where multiple clients work together to solve a machine learning task. One of the key challenges in FL is the issue of partial participation, which occurs when a large number of clients are involved in the training process. The traditional method to address this problem is randomly selecting a subset of clients at each communication round. In our research, we propose a new technique and design a novel regularized client participation scheme. Under this scheme, each client joins the learning process every $R$ communication rounds, which we refer to as a meta epoch. We have found that this participation scheme leads to a reduction in the variance caused by client sampling. Combined with the popular FedAvg algorithm (McMahan et al., 2017), it results in superior rates under standard assumptions. For instance, the optimization term in our main convergence bound decreases linearly with the product of the number of communication rounds and the size of the local dataset of each client, and the statistical term scales with step size quadratically instead of linearly (the case for client sampling with replacement), leading to better convergence rate $\mathcal{O}\left(\frac{1}{T^2}\right)$ compared to $\mathcal{O}\left(\frac{1}{T}\right)$, where $T$ is the total number of communication rounds. Furthermore, our results permit arbitrary client availability as long as each client is available for training once per each meta epoch.
    Generalization Bounds of Nonconvex-(Strongly)-Concave Stochastic Minimax Optimization. (arXiv:2205.14278v2 [math.OC] UPDATED)
    This paper takes an initial step to systematically investigate the generalization bounds of algorithms for solving nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization measured by the stationarity of primal functions. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem. The sample complexities for achieving $\epsilon$-generalization are $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{-2})$ and $\tilde{\mathcal{O}}(d\epsilon^{-4})$ for NC-SC and NC-C settings, respectively, where $d$ is the dimension and $\kappa$ is the condition number. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between generalization bounds and the stability notion. As a result, we establish algorithm-dependent generalization bounds for stochastic gradient descent ascent (SGDA) algorithm and the more general sampling-determined algorithms.
    Graph Kernels Based on Multi-scale Graph Embeddings. (arXiv:2206.00979v2 [cs.LG] UPDATED)
    Graph kernels are conventional methods for computing graph similarities. However, most of the R-convolution graph kernels face two challenges: 1) They cannot compare graphs at multiple different scales, and 2) they do not consider the distributions of substructures when computing the kernel matrix. These two challenges limit their performances. To mitigate the two challenges, we propose a novel graph kernel called the Multi-scale Path-pattern Graph kernel (MPG), at the heart of which is the multi-scale path-pattern node feature map. Each element of the path-pattern node feature map is the number of occurrences of a path-pattern around a node. A path-pattern is constructed by the concatenation of all the node labels in a path of a truncated BFS tree rooted at each node. Since the path-pattern node feature map can only compare graphs at local scales, we incorporate into it the multiple different scales of the graph structure, which are captured by the truncated BFS trees of different depth. We use the Wasserstein distance to compute the similarity between the multi-scale path-pattern node feature maps of two graphs, considering the distributions of path-patterns. We empirically validate MPG on various benchmark graph datasets and demonstrate that it achieves state-of-the-art performance.
    Deep Ensembles for Graphs with Higher-order Dependencies. (arXiv:2205.13988v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) continue to achieve state-of-the-art performance on many graph learning tasks, but rely on the assumption that a given graph is a sufficient approximation of the true neighborhood structure. When a system contains higher-order sequential dependencies, we show that the tendency of traditional graph representations to underfit each node's neighborhood causes existing GNNs to generalize poorly. To address this, we propose a novel Deep Graph Ensemble (DGE), which captures neighborhood variance by training an ensemble of GNNs on different neighborhood subspaces of the same node within a higher-order network structure. We show that DGE consistently outperforms existing GNNs on semisupervised and supervised tasks on six real-world data sets with known higher-order dependencies, even under a similar parameter budget. We demonstrate that learning diverse and accurate base classifiers is central to DGE's success, and discuss the implications of these findings for future work on ensembles of GNNs.
    Population-size-Aware Policy Optimization for Mean-Field Games. (arXiv:2302.03364v1 [cs.LG])
    In this work, we attempt to bridge the two fields of finite-agent and infinite-agent games, by studying how the optimal policies of agents evolve with the number of agents (population size) in mean-field games, an agent-centric perspective in contrast to the existing works focusing typically on the convergence of the empirical distribution of the population. To this end, the premise is to obtain the optimal policies of a set of finite-agent games with different population sizes. However, either deriving the closed-form solution for each game is theoretically intractable, training a distinct policy for each game is computationally intensive, or directly applying the policy trained in a game to other games is sub-optimal. We address these challenges through the Population-size-Aware Policy Optimization (PAPO). Our contributions are three-fold. First, to efficiently generate efficient policies for games with different population sizes, we propose PAPO, which unifies two natural options (augmentation and hypernetwork) and achieves significantly better performance. PAPO consists of three components: i) the population-size encoding which transforms the original value of population size to an equivalent encoding to avoid training collapse, ii) a hypernetwork to generate a distinct policy for each game conditioned on the population size, and iii) the population size as an additional input to the generated policy. Next, we construct a multi-task-based training procedure to efficiently train the neural networks of PAPO by sampling data from multiple games with different population sizes. Finally, extensive experiments on multiple environments show the significant superiority of PAPO over baselines, and the analysis of the evolution of the generated policies further deepens our understanding of the two fields of finite-agent and infinite-agent games.
    Co-Imitation: Learning Design and Behaviour by Imitation. (arXiv:2209.01207v2 [cs.LG] UPDATED)
    The co-adaptation of robots has been a long-standing research endeavour with the goal of adapting both body and behaviour of a system for a given task, inspired by the natural evolution of animals. Co-adaptation has the potential to eliminate costly manual hardware engineering as well as improve the performance of systems. The standard approach to co-adaptation is to use a reward function for optimizing behaviour and morphology. However, defining and constructing such reward functions is notoriously difficult and often a significant engineering effort. This paper introduces a new viewpoint on the co-adaptation problem, which we call co-imitation: finding a morphology and a policy that allow an imitator to closely match the behaviour of a demonstrator. To this end we propose a co-imitation methodology for adapting behaviour and morphology by matching state distributions of the demonstrator. Specifically, we focus on the challenging scenario with mismatched state- and action-spaces between both agents. We find that co-imitation increases behaviour similarity across a variety of tasks and settings, and demonstrate co-imitation by transferring human walking, jogging and kicking skills onto a simulated humanoid.
    Merging satellite and gauge-measured precipitation using LightGBM with an emphasis on extreme quantiles. (arXiv:2302.03606v1 [eess.SP])
    Knowing the actual precipitation in space and time is critical in hydrological modelling applications, yet the spatial coverage with rain gauge stations is limited due to economic constraints. Gridded satellite precipitation datasets offer an alternative option for estimating the actual precipitation by covering uniformly large areas, albeit related estimates are not accurate. To improve precipitation estimates, machine learning is applied to merge rain gauge-based measurements and gridded satellite precipitation products. In this context, observed precipitation plays the role of the dependent variable, while satellite data play the role of predictor variables. Random forests is the dominant machine learning algorithm in relevant applications. In those spatial predictions settings, point predictions (mostly the mean or the median of the conditional distribution) of the dependent variable are issued. Here we propose, issuing probabilistic spatial predictions of precipitation using Light Gradient Boosting Machine (LightGBM). LightGBM is a boosting algorithm, highlighted by prize-winning entries in prediction and forecasting competitions. To assess LightGBM, we contribute a large-scale application that includes merging daily precipitation measurements in contiguous US with PERSIANN and GPM-IMERG satellite precipitation data. We focus on extreme quantiles of the probability distribution of the dependent variable, where LightGBM outperforms quantile regression forests (QRF, a variant of random forests) in terms of quantile score. LightGBM and QRF show similar performance when predicting functionals at the centre of the conditional probability distribution, including the conditional median. Our study offers understanding of probabilistic predictions in spatial settings using machine learning.
    Deep Class-Incremental Learning: A Survey. (arXiv:2302.03648v1 [cs.CV])
    Deep models, e.g., CNNs and Vision Transformers, have achieved impressive achievements in many vision tasks in the closed world. However, novel classes emerge from time to time in our ever-changing world, requiring a learning system to acquire new knowledge continually. For example, a robot needs to understand new instructions, and an opinion monitoring system should analyze emerging topics every day. Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally and build a universal classifier among all seen classes. Correspondingly, when directly training the model with new class instances, a fatal problem occurs -- the model tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades. There have been numerous efforts to tackle catastrophic forgetting in the machine learning community. In this paper, we survey comprehensively recent advances in deep class-incremental learning and summarize these methods from three aspects, i.e., data-centric, model-centric, and algorithm-centric. We also provide a rigorous and unified evaluation of 16 methods in benchmark image classification tasks to find out the characteristics of different algorithms empirically. Furthermore, we notice that the current comparison protocol ignores the influence of memory budget in model storage, which may result in unfair comparison and biased results. Hence, we advocate fair comparison by aligning the memory budget in evaluation, as well as several memory-agnostic performance measures. The source code to reproduce these evaluations is available at https://github.com/zhoudw-zdw/CIL_Survey/
    Black Box Optimization Using QUBO and the Cross Entropy Method. (arXiv:2206.12510v2 [cs.LG] UPDATED)
    Black-box optimization (BBO) can be used to optimize functions whose analytic form is unknown. A common approach to realising BBO is to learn a surrogate model which approximates the target black-box function which can then be solved via white-box optimization methods. In this paper, we present our approach BOX-QUBO, where the surrogate model is a QUBO matrix. However, unlike in previous state-of-the-art approaches, this matrix is not trained entirely by regression, but mostly by classification between 'good' and 'bad' solutions. This better accounts for the low capacity of the QUBO matrix, resulting in significantly better solutions overall. We tested our approach against the state-of-the-art on four domains and in all of them BOX-QUBO showed better results. A second contribution of this paper is the idea to also solve white-box problems, i.e. problems which could be directly formulated as QUBO, by means of black-box optimization in order to reduce the size of the QUBOs to the information-theoretic minimum. Experiments show that this significantly improves the results for MAX-k-SAT.
    Backpropagation on Dynamical Networks. (arXiv:2207.03093v2 [math.DS] UPDATED)
    Dynamical networks are versatile models that can describe a variety of behaviours such as synchronisation and feedback. However, applying these models in real world contexts is difficult as prior information pertaining to the connectivity structure or local dynamics is often unknown and must be inferred from time series observations of network states. Additionally, the influence of coupling interactions between nodes further complicates the isolation of local node dynamics. Given the architectural similarities between dynamical networks and recurrent neural networks (RNN), we propose a network inference method based on the backpropagation through time (BPTT) algorithm commonly used to train recurrent neural networks. This method aims to simultaneously infer both the connectivity structure and local node dynamics purely from observation of node states. An approximation of local node dynamics is first constructed using a neural network. This is alternated with an adapted BPTT algorithm to regress corresponding network weights by minimising prediction errors of the dynamical network based on the previously constructed local models until convergence is achieved. This method was found to be succesful in identifying the connectivity structure for coupled networks of Lorenz, Chua and FitzHugh-Nagumo oscillators. Freerun prediction performance with the resulting local models and weights was found to be comparable to the true system with noisy initial conditions. The method is also extended to non-conventional network couplings such as asymmetric negative coupling.
    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v4 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
    Adaptive Aggregation for Safety-Critical Control. (arXiv:2302.03586v1 [cs.LG])
    Safety has been recognized as the central obstacle to preventing the use of reinforcement learning (RL) for real-world applications. Different methods have been developed to deal with safety concerns in RL. However, learning reliable RL-based solutions usually require a large number of interactions with the environment. Likewise, how to improve the learning efficiency, specifically, how to utilize transfer learning for safe reinforcement learning, has not been well studied. In this work, we propose an adaptive aggregation framework for safety-critical control. Our method comprises two key techniques: 1) we learn to transfer the safety knowledge by aggregating the multiple source tasks and a target task through the attention network; 2) we separate the goal of improving task performance and reducing constraint violations by utilizing a safeguard. Experiment results demonstrate that our algorithm can achieve fewer safety violations while showing better data efficiency compared with several baselines.
    Online Bayesian Meta-Learning for Cognitive Tracking Radar. (arXiv:2207.06917v2 [cs.IT] UPDATED)
    A key component of cognitive radar is the ability to generalize, or achieve consistent performance across a range of sensing environments, since aspects of the physical scene may vary over time. This presents a challenge for learning-based waveform selection approaches, since transmission policies which are effective in one scene may be highly suboptimal in another. We address this problem by strategically biasing a learning algorithm by exploiting high-level structure across tracking instances, referred to as meta-learning. In this work, we develop an online meta-learning approach for waveform-agile tracking. This approach uses information gained from previous target tracks to speed up and enhance learning in new tracking instances. This results in sample-efficient learning across a class of finite state target channels by exploiting inherent similarity across tracking scenes, attributed to common physical elements such as target type or clutter statistics. We formulate the online waveform selection problem within the framework of Bayesian learning, and provide prior-dependent performance bounds for the meta-learning problem using Probability Approximately Correct (PAC)-Bayes theory. We present a computationally feasible meta-posterior sampling algorithm and study the performance in a simulation study consisting of diverse scenes. Finally, we examine the potential performance benefits and practical challenges associated with online meta-learning for waveform-agile tracking.
    Novel Fundus Image Preprocessing for Retcam Images to Improve Deep Learning Classification of Retinopathy of Prematurity. (arXiv:2302.02524v2 [eess.IV] UPDATED)
    Retinopathy of Prematurity (ROP) is a potentially blinding eye disorder because of damage to the eye's retina which can affect babies born prematurely. Screening of ROP is essential for early detection and treatment. This is a laborious and manual process which requires trained physician performing dilated ophthalmological examination which can be subjective resulting in lower diagnosis success for clinically significant disease. Automated diagnostic methods can assist ophthalmologists increase diagnosis accuracy using deep learning. Several research groups have highlighted various approaches. This paper proposes the use of new novel fundus preprocessing methods using pretrained transfer learning frameworks to create hybrid models to give higher diagnosis accuracy. The evaluations show that these novel methods in comparison to traditional imaging processing contribute to higher accuracy in classifying Plus disease, Stages of ROP and Zones. We achieve accuracy of 97.65% for Plus disease, 89.44% for Stage, 90.24% for Zones with limited training dataset.
    Subtyping patients with chronic disease using longitudinal BMI patterns. (arXiv:2111.05385v2 [cs.LG] UPDATED)
    Obesity is a major health problem, increasing the risk of various major chronic diseases, such as diabetes, cancer, and stroke. While the role of obesity identified by cross-sectional BMI recordings has been heavily studied, the role of BMI trajectories is much less explored. In this study, we use a machine-learning approach to subtype individuals' risk of developing 18 major chronic diseases by using their BMI trajectories extracted from a large and geographically diverse EHR dataset capturing the health status of around two million individuals for a period of six years. We define nine new interpretable and evidence-based variables based on the BMI trajectories to cluster the patients into subgroups using the k-means clustering method. We thoroughly review each cluster's characteristics in terms of demographic, socioeconomic, and physiological measurement variables to specify the distinct properties of the patients in the clusters. In our experiments, the direct relationship of obesity with diabetes, hypertension, Alzheimer's, and dementia has been re-established and distinct clusters with specific characteristics for several of the chronic diseases have been found to be conforming or complementary to the existing body of knowledge.
    Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality. (arXiv:2110.10745v3 [stat.ML] UPDATED)
    Parameter learning for high-dimensional, partially observed, and nonlinear stochastic processes is a methodological challenge. Spatiotemporal disease transmission systems provide examples of such processes giving rise to open inference problems. We propose the iterated block particle filter (IBPF) algorithm for learning high-dimensional parameters over graphical state space models with general state spaces, measures, transition densities and graph structure. Theoretical performance guarantees are obtained on beating the curse of dimensionality (COD), algorithm convergence, and likelihood maximization. Experiments on a highly nonlinear and non-Gaussian spatiotemporal model for measles transmission reveal that the iterated ensemble Kalman filter algorithm (Li et al. (2020)) is ineffective and the iterated filtering algorithm (Ionides et al. (2015)) suffers from the COD, while our IBPF algorithm beats COD consistently across various experiments with different metrics.
    Mitigating Algorithmic Bias with Limited Annotations. (arXiv:2207.10018v2 [cs.LG] UPDATED)
    Existing work on fairness modeling commonly assumes that sensitive attributes for all instances are fully available, which may not be true in many real-world applications due to the high cost of acquiring sensitive information. When sensitive attributes are not disclosed or available, it is needed to manually annotate a small part of the training data to mitigate bias. However, the skewed distribution across different sensitive groups preserves the skewness of the original dataset in the annotated subset, which leads to non-optimal bias mitigation. To tackle this challenge, we propose Active Penalization Of Discrimination (APOD), an interactive framework to guide the limited annotations towards maximally eliminating the effect of algorithmic bias. The proposed APOD integrates discrimination penalization with active instance selection to efficiently utilize the limited annotation budget, and it is theoretically proved to be capable of bounding the algorithmic bias. According to the evaluation on five benchmark datasets, APOD outperforms the state-of-the-arts baseline methods under the limited annotation budget, and shows comparable performance to fully annotated bias mitigation, which demonstrates that APOD could benefit real-world applications when sensitive information is limited.
    A Lightweight, Efficient and Explainable-by-Design Convolutional Neural Network for Internet Traffic Classification. (arXiv:2202.05535v3 [cs.LG] UPDATED)
    Traffic classification, i.e. the identification of the type of applications flowing in a network, is a strategic task for numerous activities (e.g., intrusion detection, routing). This task faces some critical challenges that current deep learning approaches do not address. The design of current approaches do not take into consideration the fact that networking hardware (e.g., routers) often runs with limited computational resources. Further, they do not meet the need for faithful explainability highlighted by regulatory bodies. Finally, these traffic classifiers are evaluated on small datasets which fail to reflect the diversity of applications in real-world settings. Therefore, this paper introduces a new Lightweight, Efficient and eXplainable-by-design convolutional neural network (LEXNet) for Internet traffic classification, which relies on a new residual block (for lightweight and efficiency purposes) and prototype layer (for explainability). Based on a commercial-grade dataset, our evaluation shows that LEXNet succeeds to maintain the same accuracy as the best performing state-of-the-art neural network, while providing the additional features previously mentioned. Moreover, we illustrate the explainability feature of our approach, which stems from the communication of detected application prototypes to the end-user, and we highlight the faithfulness of LEXNet explanations through a comparison with post hoc methods.
    CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets. (arXiv:2302.02551v2 [cs.CV] UPDATED)
    Open vocabulary models (e.g. CLIP) have shown strong performance on zero-shot classification through their ability generate embeddings for each class based on their (natural language) names. Prior work has focused on improving the accuracy of these models through prompt engineering or by incorporating a small amount of labeled downstream data (via finetuning). However, there has been little focus on improving the richness of the class names themselves, which can pose issues when class labels are coarsely-defined and uninformative. We propose Classification with Hierarchical Label Sets (or CHiLS), an alternative strategy for zero-shot classification specifically designed for datasets with implicit semantic hierarchies. CHiLS proceeds in three steps: (i) for each class, produce a set of subclasses, using either existing label hierarchies or by querying GPT-3; (ii) perform the standard zero-shot CLIP procedure as though these subclasses were the labels of interest; (iii) map the predicted subclass back to its parent to produce the final prediction. Across numerous datasets with underlying hierarchical structure, CHiLS leads to improved accuracy in situations both with and without ground-truth hierarchical information. CHiLS is simple to implement within existing CLIP pipelines and requires no additional training cost. Code is available at: https://github.com/acmi-lab/CHILS.
    Toward Face Biometric De-identification using Adversarial Examples. (arXiv:2302.03657v1 [cs.CV])
    The remarkable success of face recognition (FR) has endangered the privacy of internet users particularly in social media. Recently, researchers turned to use adversarial examples as a countermeasure. In this paper, we assess the effectiveness of using two widely known adversarial methods (BIM and ILLC) for de-identifying personal images. We discovered, unlike previous claims in the literature, that it is not easy to get a high protection success rate (suppressing identification rate) with imperceptible adversarial perturbation to the human visual system. Finally, we found out that the transferability of adversarial examples is highly affected by the training parameters of the network with which they are generated.
    DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. (arXiv:2212.03597v2 [cs.LG] UPDATED)
    Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focus on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two novel data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. DeepSpeed Data Efficiency also takes extensibility, flexibility and composability into consideration, so that users can easily utilize the framework to compose multiple techniques and apply customized strategies. By applying our solution to GPT-3 1.3B and BERT-large language model pretraining, we can achieve similar model quality with up to 2x less data and 2x less time, or achieve better model quality under similar amount of data and time.
    Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery. (arXiv:2302.03668v1 [cs.LG])
    The strength of modern generative models lies in their ability to be controlled through text-based prompts. Typical "hard" prompts are made from interpretable words and tokens, and must be hand-crafted by humans. There are also "soft" prompts, which consist of continuous feature vectors. These can be discovered using powerful optimization methods, but they cannot be easily interpreted, re-used across models, or plugged into a text-based interface. We describe an approach to robustly optimize hard text prompts through efficient gradient-based optimization. Our approach automatically generates hard text-based prompts for both text-to-image and text-to-text applications. In the text-to-image setting, the method creates hard prompts for diffusion models, allowing API users to easily generate, discover, and mix and match image concepts without prior knowledge on how to prompt the model. In the text-to-text setting, we show that hard prompts can be automatically discovered that are effective in tuning LMs for classification.
    Incentive-aware Contextual Pricing with Non-parametric Market Noise. (arXiv:1911.03508v3 [cs.LG] UPDATED)
    We consider a dynamic pricing problem for repeated contextual second-price auctions with multiple strategic buyers who aim to maximize their long-term time discounted utility. The seller has limited information on buyers' overall demand curves which depends on a non-parametric market-noise distribution, and buyers may potentially submit corrupted bids (relative to true valuations) to manipulate the seller's pricing policy for more favorable reserve prices in the future. We focus on designing the seller's learning policy to set contextual reserve prices where the seller's goal is to minimize regret compared to the revenue of a benchmark clairvoyant policy that has full information of buyers' demand. We propose a policy with a phased-structure that incorporates randomized "isolation" periods, during which a buyer is randomly chosen to solely participate in the auction. We show that this design allows the seller to control the number of periods in which buyers significantly corrupt their bids. We then prove that our policy enjoys a $T$-period regret of $\widetilde{\mathcal{O}}(\sqrt{T})$ facing strategic buyers. Finally, we conduct numerical simulations to compare our proposed algorithm to standard pricing policies. Our numerical results show that our algorithm outperforms these policies under various buyer bidding behavior.
    A Scalable and Efficient Iterative Method for Copying Machine Learning Classifiers. (arXiv:2302.02667v2 [cs.LG] UPDATED)
    Differential replication through copying refers to the process of replicating the decision behavior of a machine learning model using another model that possesses enhanced features and attributes. This process is relevant when external constraints limit the performance of an industrial predictive system. Under such circumstances, copying enables the retention of original prediction capabilities while adapting to new demands. Previous research has focused on the single-pass implementation for copying. This paper introduces a novel sequential approach that significantly reduces the amount of computational resources needed to train or maintain a copy, leading to reduced maintenance costs for companies using machine learning models in production. The effectiveness of the sequential approach is demonstrated through experiments with synthetic and real-world datasets, showing significant reductions in time and resources, while maintaining or improving accuracy.
    TransPolymer: a Transformer-based Language Model for Polymer Property Predictions. (arXiv:2209.01307v3 [cs.LG] UPDATED)
    Accurate and efficient prediction of polymer properties is of great significance in polymer development and design. Conventionally, expensive and time-consuming experiments or simulations are required to evaluate the function of polymers. Recently, Transformer models, equipped with attention mechanisms, have exhibited superior performance in various natural language processing tasks. However, such methods have not been investigated in polymer sciences. Herein, we report TransPolymer, a Transformer-based language model for polymer property prediction. Owing to our proposed polymer tokenizer with chemical awareness, TransPolymer can learn representations directly from polymer sequences. The model learns expressive representations by pretraining on a large unlabeled dataset via masked language modeling, followed by finetuning the model on downstream datasets concerning various polymer properties. TransPolymer achieves superior performance in all ten downstream tasks and surpasses other baselines significantly on most downstream tasks. Moreover, the improvement by the pretrained TransPolymer over supervised TransPolymer and other language models strengthens the significant benefits of pretraining on large unlabeled data in representation learning. Experiment results further demonstrate the important role of the attention mechanism in understanding polymer sequences. We highlight this model as a promising computational tool for promoting rational polymer design and understanding structure-property relationships in a data science view.
    PartitionVAE -- a human-interpretable VAE. (arXiv:2302.03689v1 [cs.LG])
    VAEs, or variational autoencoders, are autoencoders that explicitly learn the distribution of the input image space rather than assuming no prior information about the distribution. This allows it to classify similar samples close to each other in the latent space's distribution. VAEs classically assume the latent space is normally distributed, though many distribution priors work, and they encode this assumption through a K-L divergence term in the loss function. While VAEs learn the distribution of the latent space and naturally make each dimension in the latent space as disjoint from the others as possible, they do not group together similar features -- the image space feature represented by one unit of the representation layer does not necessarily have high correlation with the feature…
    CRU: A Novel Neural Architecture for Improving the Predictive Performance of Time-Series Data. (arXiv:2211.16653v2 [cs.LG] UPDATED)
    The time-series forecasting (TSF) problem is a traditional problem in the field of artificial intelligence. Models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and GRU (Gate Recurrent Units) have contributed to improving the predictive accuracy of TSF. Furthermore, model structures have been proposed to combine time-series decomposition methods, such as seasonal-trend decomposition using Loess (STL) to ensure improved predictive accuracy. However, because this approach is learned in an independent model for each component, it cannot learn the relationships between time-series components. In this study, we propose a new neural architecture called a correlation recurrent unit (CRU) that can perform time series decomposition within a neural cell and learn correlations (autocorrelation and correlation) between each decomposition component. The proposed neural architecture was evaluated through comparative experiments with previous studies using five univariate time-series datasets and four multivariate time-series data. The results showed that long- and short-term predictive performance was improved by more than 10%. The experimental results show that the proposed CRU is an excellent method for TSF problems compared to other neural architectures.
    Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective. (arXiv:2302.03561v1 [cs.LG])
    We study the problem of optimizing a recommender system for outcomes that occur over several weeks or months. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationships with a recommender system. Measurement, attribution, and coordination challenges complicate algorithm design. We describe careful modeling -- including a new representation of user state and key conditional independence assumptions -- which overcomes these challenges and leads to simple, testable recommender system prototypes. We apply our approach to a podcast recommender system that makes personalized recommendations to hundreds of millions of listeners. A/B tests demonstrate that purposefully optimizing for long-term outcomes leads to large performance gains over conventional approaches that optimize for short-term proxies.
    Riemannian Flow Matching on General Geometries. (arXiv:2302.03660v1 [cs.LG])
    We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, inherently cannot scale to high dimensions, or use approximations to limiting quantities that result in biased objectives. Riemannian Flow Matching bypasses these inconveniences and exhibits multiple benefits over prior approaches: It is completely simulation-free on simple geometries, it does not require divergence computation, and its target vector field is computed in closed form even on general geometries. The key ingredient behind RFM is the construction of a simple kernel function for defining per-sample vector fields, which subsumes existing Euclidean cases. Extending to general geometries, we rely on the use of spectral decompositions to efficiently compute kernel functions. Our method achieves state-of-the-art performance on real-world non-Euclidean datasets, and we showcase, for the first time, tractable training on general geometries, including on triangular meshes and maze-like manifolds with boundaries.
    Local Neural Descriptor Fields: Locally Conditioned Object Representations for Manipulation. (arXiv:2302.03573v1 [cs.RO])
    A robot operating in a household environment will see a wide range of unique and unfamiliar objects. While a system could train on many of these, it is infeasible to predict all the objects a robot will see. In this paper, we present a method to generalize object manipulation skills acquired from a limited number of demonstrations, to novel objects from unseen shape categories. Our approach, Local Neural Descriptor Fields (L-NDF), utilizes neural descriptors defined on the local geometry of the object to effectively transfer manipulation demonstrations to novel objects at test time. In doing so, we leverage the local geometry shared between objects to produce a more general manipulation framework. We illustrate the efficacy of our approach in manipulating novel objects in novel poses -- both in simulation and in the real world.
    Ethical Considerations for Collecting Human-Centric Image Datasets. (arXiv:2302.03629v1 [cs.CV])
    Human-centric image datasets are critical to the development of computer vision technologies. However, recent investigations have foregrounded significant ethical issues related to privacy and bias, which have resulted in the complete retraction, or modification, of several prominent datasets. Recent works have tried to reverse this trend, for example, by proposing analytical frameworks for ethically evaluating datasets, the standardization of dataset documentation and curation practices, privacy preservation methodologies, as well as tools for surfacing and mitigating representational biases. Little attention, however, has been paid to the realities of operationalizing ethical data collection. To fill this gap, we present a set of key ethical considerations and practical recommendations for collecting more ethically-minded human-centric image data. Our research directly addresses issues of privacy and bias by contributing to the research community best practices for ethical data collection, covering purpose, privacy and consent, as well as diversity. We motivate each consideration by drawing on lessons from current practices, dataset withdrawals and audits, and analytical ethical frameworks. Our research is intended to augment recent scholarship, representing an important step toward more responsible data curation practices.
    Feature Necessity & Relevancy in ML Classifier Explanations. (arXiv:2210.15675v2 [cs.LG] UPDATED)
    Given a machine learning (ML) model and a prediction, explanations can be defined as sets of features which are sufficient for the prediction. In some applications, and besides asking for an explanation, it is also critical to understand whether sensitive features can occur in some explanation, or whether a non-interesting feature must occur in all explanations. This paper starts by relating such queries respectively with the problems of relevancy and necessity in logic-based abduction. The paper then proves membership and hardness results for several families of ML classifiers. Afterwards the paper proposes concrete algorithms for two classes of classifiers. The experimental results confirm the scalability of the proposed algorithms.
    Planted Bipartite Graph Detection. (arXiv:2302.03658v1 [cs.DS])
    We consider the task of detecting a hidden bipartite subgraph in a given random graph. Specifically, under the null hypothesis, the graph is a realization of an Erd\H{o}s-R\'{e}nyi random graph over $n$ vertices with edge density $q$. Under the alternative, there exists a planted $k_{\mathsf{R}} \times k_{\mathsf{L}}$ bipartite subgraph with edge density $p>q$. We derive asymptotically tight upper and lower bounds for this detection problem in both the dense regime, where $q,p = \Theta\left(1\right)$, and the sparse regime where $q,p = \Theta\left(n^{-\alpha}\right), \alpha \in \left(0,2\right]$. Moreover, we consider a variant of the above problem, where one can only observe a relatively small part of the graph, by using at most $\mathsf{Q}$ edge queries. For this problem, we derive upper and lower bounds in both the dense and sparse regimes.
    Meta-Learning Biologically Plausible Plasticity Rules with Random Feedback Pathways. (arXiv:2210.16414v5 [q-bio.NC] UPDATED)
    Backpropagation is widely used to train artificial neural networks, but its relationship to synaptic plasticity in the brain is unknown. Some biological models of backpropagation rely on feedback projections that are symmetric with feedforward connections, but experiments do not corroborate the existence of such symmetric backward connectivity. Random feedback alignment offers an alternative model in which errors are propagated backward through fixed, random backward connections. This approach successfully trains shallow models, but learns slowly and does not perform well with deeper models or online learning. In this study, we develop a meta-learning approach to discover interpretable, biologically plausible plasticity rules that improve online learning performance with fixed random feedback connections. The resulting plasticity rules show improved online training of deep models in the low data regime. Our results highlight the potential of meta-learning to discover effective, interpretable learning rules satisfying biological constraints.
    Two Losses Are Better Than One: Faster Optimization Using a Cheaper Proxy. (arXiv:2302.03542v1 [cs.LG])
    We present an algorithm for minimizing an objective with hard-to-compute gradients by using a related, easier-to-access function as a proxy. Our algorithm is based on approximate proximal point iterations on the proxy combined with relatively few stochastic gradients from the objective. When the difference between the objective and the proxy is $\delta$-smooth, our algorithm guarantees convergence at a rate matching stochastic gradient descent on a $\delta$-smooth objective, which can lead to substantially better sample efficiency. Our algorithm has many potential applications in machine learning, and provides a principled means of leveraging synthetic data, physics simulators, mixed public and private data, and more.
    Learning to cooperatively estimate road surface friction. (arXiv:2302.03560v1 [math.OC])
    We present a system for estimating the friction of the pavement surface at any curved road section, by arriving at a consensus estimate, based on data from vehicles that have recently passed through that section. This estimate can help following vehicles. To keep costs down, we depend only on standard automotive sensors, such as the IMU, and sensors for the steering angle and wheel speeds. Our system's workflow consists of: (i) processing of measurements from existing vehicular sensors, to implement a virtual sensor that captures the effect of low friction on the vehicle, (ii) transmitting short kinematic summaries from vehicles to a road side unit (RSU), using V2X communication, and (iii) estimating the friction coefficients, by running a machine learning regressor at the RSU, on summaries from individual vehicles, and then combining several such estimates. In designing and implementing our system over a road network, we face two key questions: (i) should each individual road section have a local friction coefficient regressor, or can we use a global regressor that covers all the possible road sections? and (ii) how accurate are the resulting regressor estimates? We test the performance of design variations of our solution, using simulations on the commercial package Dyna4. We consider a single vehicle type with varying levels of tyre wear, and a range of road friction coefficients. We find that: (a) only a marginal loss of accuracy is incurred in using a global regressor as compared to local regressors, (b) the consensus estimate at the RSU has a worst case error of about ten percent, if the combination is based on at least fifty recently passed vehicles, and (c) our regressors have root mean square (RMS) errors that are less than five percent. The RMS error rate of our system is half as that of a commercial friction estimation service.
    Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification. (arXiv:2210.07983v3 [cs.CV] UPDATED)
    In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.
    From Utilitarian to Rawlsian Designs for Algorithmic Fairness. (arXiv:2302.03567v1 [cs.CY])
    There is a lack of consensus within the literature as to how `fairness' of algorithmic systems can be measured, and different metrics can often be at odds. In this paper, we approach this task by drawing on the ethical frameworks of utilitarianism and John Rawls. Informally, these two theories of distributive justice measure the `good' as either a population's sum of utility, or worst-off outcomes, respectively. We present a parameterized class of objective functions that interpolates between these two (possibly) conflicting notions of the `good'. This class is shown to represent a relaxation of the Rawlsian `veil of ignorance', and its sequence of optimal solutions converges to both a utilitarian and Rawlsian optimum. Several other properties of this class are studied, including: 1) a relationship to regularized optimization, 2) feasibility of consistent estimation, and 3) algorithmic cost. In several real-world datasets, we compute optimal solutions and construct the tradeoff between utilitarian and Rawlsian notions of the `good'. Empirically, we demonstrate that increasing model complexity can manifest strict improvements to both measures of the `good'. This work suggests that the proper degree of `fairness' can be informed by a designer's preferences over the space of induced utilitarian and Rawlsian `good'.
    Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models. (arXiv:2302.03491v1 [cs.CL])
    Learned metrics such as BLEURT have in recent years become widely employed to evaluate the quality of machine translation systems. Training such metrics requires data which can be expensive and difficult to acquire, particularly for lower-resource languages. We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon such learned metrics without requiring human annotators, by creating synthetic datasets which can be mixed into existing datasets, requiring only a corpus of text in the target language. We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
    Graph Generation with Destination-Driven Diffusion Mixture. (arXiv:2302.03596v1 [cs.LG])
    Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures. Although diffusion models have achieved notable success in graph generation recently, they are ill-suited for modeling the structural information of graphs since learning to denoise the noisy samples does not explicitly capture the graph topology. To tackle this limitation, we propose a novel generative process that models the topology of graphs by predicting the destination of the process. Specifically, we design the generative process as a mixture of diffusion processes conditioned on the endpoint in the data distribution, which drives the process toward the probable destination. Further, we introduce new training objectives for learning to predict the destination, and discuss the advantages of our generative framework that can explicitly model the graph topology and exploit the inductive bias of the data. Through extensive experimental validation on general graph and 2D/3D molecular graph generation tasks, we show that our method outperforms previous generative models, generating graphs with correct topology with both continuous and discrete features.
    Robustness Implies Fairness in Casual Algorithmic Recourse. (arXiv:2302.03465v1 [cs.LG])
    Algorithmic recourse aims to disclose the inner workings of the black-box decision process in situations where decisions have significant consequences, by providing recommendations to empower beneficiaries to achieve a more favorable outcome. To ensure an effective remedy, suggested interventions must not only be low-cost but also robust and fair. This goal is accomplished by providing similar explanations to individuals who are alike. This study explores the concept of individual fairness and adversarial robustness in causal algorithmic recourse and addresses the challenge of achieving both. To resolve the challenges, we propose a new framework for defining adversarially robust recourse. The new setting views the protected feature as a pseudometric and demonstrates that individual fairness is a special case of adversarial robustness. Finally, we introduce the fair robust recourse problem to achieve both desirable properties and show how it can be satisfied both theoretically and empirically.
    Multi-Scale Message Passing Neural PDE Solvers. (arXiv:2302.03580v1 [cs.LG])
    We propose a novel multi-scale message passing neural network algorithm for learning the solutions of time-dependent PDEs. Our algorithm possesses both temporal and spatial multi-scale resolution features by incorporating multi-scale sequence models and graph gating modules in the encoder and processor, respectively. Benchmark numerical experiments are presented to demonstrate that the proposed algorithm outperforms baselines, particularly on a PDE with a range of spatial and temporal scales.
    A Bayesian Optimization approach for calibrating large-scale activity-based transport models. (arXiv:2302.03480v1 [cs.LG])
    The use of Agent-Based and Activity-Based modeling in transportation is rising due to the capability of addressing complex applications such as disruptive trends (e.g., remote working and automation) or the design and assessment of disaggregated management strategies. Still, the broad adoption of large-scale disaggregate models is not materializing due to the inherently high complexity and computational needs. Activity-based models focused on behavioral theory, for example, may involve hundreds of parameters that need to be calibrated to match the detailed socio-economical characteristics of the population for any case study. This paper tackles this issue by proposing a novel Bayesian Optimization approach incorporating a surrogate model in the form of an improved Random Forest, designed to automate the calibration process of the behavioral parameters. The proposed method is tested on a case study for the city of Tallinn, Estonia, where the model to be calibrated consists of 477 behavioral parameters, using the SimMobility MT software. Satisfactory performance is achieved in the major indicators defined for the calibration process: the error for the overall number of trips is equal to 4% and the average error in the OD matrix is 15.92 vehicles per day.
    Efficient Parametric Approximations of Neural Network Function Space Distance. (arXiv:2302.03519v1 [cs.LG])
    It is often useful to compactly summarize important properties of model parameters and training data so that they can be used later without storing and/or iterating over the entire dataset. As a specific case, we consider estimating the Function Space Distance (FSD) over a training set, i.e. the average discrepancy between the outputs of two neural networks. We propose a Linearized Activation Function TRick (LAFTR) and derive an efficient approximation to FSD for ReLU neural networks. The key idea is to approximate the architecture as a linear network with stochastic gating. Despite requiring only one parameter per unit of the network, our approach outcompetes other parametric approximations with larger memory requirements. Applied to continual learning, our parametric approximation is competitive with state-of-the-art nonparametric approximations, which require storing many training examples. Furthermore, we show its efficacy in estimating influence functions accurately and detecting mislabeled examples without expensive iterations over the entire dataset.
    Uncoupled Learning of Differential Stackelberg Equilibria with Commitments. (arXiv:2302.03438v1 [cs.LG])
    A natural solution concept for many multiagent settings is the Stackelberg equilibrium, under which a ``leader'' agent selects a strategy that maximizes its own payoff assuming the ``follower'' chooses their best response to this strategy. Recent work has presented asymmetric learning updates that can be shown to converge to the \textit{differential} Stackelberg equilibria of two-player differentiable games. These updates are ``coupled'' in the sense that the leader requires some information about the follower's payoff function. Such coupled learning rules cannot be applied to \textit{ad hoc} interactive learning settings, and can be computationally impractical even in centralized training settings where the follower's payoffs are known. In this work, we present an ``uncoupled'' learning process under which each player's learning update only depends on their observations of the other's behavior. We prove that this process converges to a local Stackelberg equilibrium under similar conditions as previous coupled methods. We conclude with a discussion of the potential applications of our approach to human--AI cooperation and multi-agent reinforcement learning.
    PDPU: An Open-Source Posit Dot-Product Unit for Deep Learning Applications. (arXiv:2302.01876v1 [cs.AR] CROSS LISTED)
    Posit has been a promising alternative to the IEEE-754 floating point format for deep learning applications due to its better trade-off between dynamic range and accuracy. However, hardware implementation of posit arithmetic requires further exploration, especially for the dot-product operations dominated in deep neural networks (DNNs). It has been implemented by either the combination of multipliers and an adder tree or cascaded fused multiply-add units, leading to poor computational efficiency and excessive hardware overhead. To address this issue, we propose an open-source posit dot-product unit, namely PDPU, that facilitates resource-efficient and high-throughput dot-product hardware implementation. PDPU not only features the fused and mixed-precision architecture that eliminates redundant latency and hardware resources, but also has a fine-grained 6-stage pipeline, improving computational efficiency. A configurable PDPU generator is further developed to meet the diverse needs of various DNNs for computational accuracy. Experimental results evaluated under the 28nm CMOS process show that PDPU reduces area, latency, and power by up to 43%, 64%, and 70%, respectively, compared to the existing implementations. Hence, PDPU has great potential as the computing core of posit-based accelerators for deep learning applications.
    OPORP: One Permutation + One Random Projection. (arXiv:2302.03505v1 [stat.ML])
    Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show that the estimation variance is essentially: $(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a function of the data ($u,v$). This formula reveals several key properties: (1) We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is actually the asymptotic variance of the classical correlation estimator. We illustrate that by letting the $k$ in OPORP to be $k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.
    Towards Robust Inductive Graph Incremental Learning via Experience Replay. (arXiv:2302.03534v1 [cs.LG])
    Inductive node-wise graph incremental learning is a challenging task due to the dynamic nature of evolving graphs and the dependencies between nodes. In this paper, we propose a novel experience replay framework, called Structure-Evolution-Aware Experience Replay (SEA-ER), that addresses these challenges by leveraging the topological awareness of GNNs and importance reweighting technique. Our framework effectively addresses the data dependency of node prediction problems in evolving graphs, with a theoretical guarantee that supports its effectiveness. Through empirical evaluation, we demonstrate that our proposed framework outperforms the current state-of-the-art GNN experience replay methods on several benchmark datasets, as measured by metrics such as accuracy and forgetting.
    Towards Better Time Series Contrastive Learning: A Dynamic Bad Pair Mining Approach. (arXiv:2302.03357v1 [cs.LG])
    Not all positive pairs are beneficial to time series contrastive learning. In this paper, we study two types of bad positive pairs that impair the quality of time series representation learned through contrastive learning ($i.e.$, noisy positive pair and faulty positive pair). We show that, with the presence of noisy positive pairs, the model tends to simply learn the pattern of noise (Noisy Alignment). Meanwhile, when faulty positive pairs arise, the model spends considerable efforts aligning non-representative patterns (Faulty Alignment). To address this problem, we propose a Dynamic Bad Pair Mining (DBPM) algorithm, which reliably identifies and suppresses bad positive pairs in time series contrastive learning. DBPM utilizes a memory module to track the training behavior of each positive pair along training process. This allows us to identify potential bad positive pairs at each epoch based on their historical training behaviors. The identified bad pairs are then down-weighted using a transformation module. Our experimental results show that DBPM effectively mitigates the negative impacts of bad pairs, and can be easily used as a plug-in to boost performance of state-of-the-art methods. Codes will be made publicly available.
    Membership Inference Attacks against Diffusion Models. (arXiv:2302.03262v1 [cs.CR])
    Diffusion models have attracted attention in recent years as innovative generative models. In this paper, we investigate whether a diffusion model is resistant to a membership inference attack, which evaluates the privacy leakage of a machine learning model. We primarily discuss the diffusion model from the standpoints of comparison with a generative adversarial network (GAN) as conventional models and hyperparameters unique to the diffusion model, i.e., time steps, sampling steps, and sampling variances. We conduct extensive experiments with DDIM as a diffusion model and DCGAN as a GAN on the CelebA and CIFAR-10 datasets in both white-box and black-box settings and then confirm if the diffusion model is comparably resistant to a membership inference attack as GAN. Next, we demonstrate that the impact of time steps is significant and intermediate steps in a noise schedule are the most vulnerable to the attack. We also found two key insights through further analysis. First, we identify that DDIM is vulnerable to the attack for small sample sizes instead of achieving a lower FID. Second, sampling steps in hyperparameters are important for resistance to the attack, whereas the impact of sampling variances is quite limited.
    Federated Variational Inference Methods for Structured Latent Variable Models. (arXiv:2302.03314v1 [stat.ML])
    Federated learning methods, that is, methods that perform model training using data situated across different sources, whilst simultaneously not having the data leave their original source, are of increasing interest in a number of fields. However, despite this interest, the classes of models for which easily-applicable and sufficiently general approaches are available is limited, excluding many structured probabilistic models. We present a general yet elegant resolution to the aforementioned issue. The approach is based on adopting structured variational inference, an approach widely used in Bayesian machine learning, to the federated setting. Additionally, a communication-efficient variant analogous to the canonical FedAvg algorithm is explored. The effectiveness of the proposed algorithms are demonstrated, and their performance is compared on Bayesian multinomial regression, topic modelling, and mixed model examples.
    Realization of Causal Representation Learning to Adjust Confounding Bias in Latent Space. (arXiv:2211.08573v2 [cs.LG] UPDATED)
    Causal graphs are usually considered in a 2D plane, but it has rarely been noticed that within multiple relatively independent timelines, which is comparatively common in causality machine learning, the individual-level differences may lead to Causal Representation Bias (CRB). More importantly, such a blind spot has brought obstacles to interdisciplinary applications. Deep Learning (DL) methods overlooking CRBs confront the trouble of models' generalizability, while statistical analytics face difficulties in modeling individual-level features without a geometric global view. In this paper, we initially discuss the Geometric Meaning of causal graphs regarding multi-dimensional timelines; and, accordingly, analyze the scheme of CRB and explicitly define causal model generalization and individualization from a geometric perspective. We also spearhead a novel framework, Causal Representation Learning (CRL), to construct a valid learning plane (in latent space) for causal graphs, propose a particular autoencoder architecture to realize it, and experimentally prove the feasibility. Involved causal data includes Electronic Healthcare Records (EHR) to estimate medical effects and a hydrology dataset to forecast the environmentally influenced streamflow.
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v1 [stat.ML])
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. We derive time-uniform generalizations of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds and, more importantly, general techniques for constructing them. Despite being anytime-valid, our extensions remain as tight as their fixed-time counterparts. Moreover, they enable us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.
    Convergence rates for momentum stochastic gradient descent with noise of machine learning type. (arXiv:2302.03550v1 [math.OC])
    We consider the momentum stochastic gradient descent scheme (MSGD) and its continuous-in-time counterpart in the context of non-convex optimization. We show almost sure exponential convergence of the objective function value for target functions that are Lipschitz continuous and satisfy the Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on the stochastic noise that are motivated by overparameterized supervised learning applications. Moreover, we optimize the convergence rate over the set of friction parameters and show that the MSGD process almost surely converges.
    Towards Skilled Population Curriculum for Multi-Agent Reinforcement Learning. (arXiv:2302.03429v1 [cs.AI])
    Recent advances in multi-agent reinforcement learning (MARL) allow agents to coordinate their behaviors in complex environments. However, common MARL algorithms still suffer from scalability and sparse reward issues. One promising approach to resolving them is automatic curriculum learning (ACL). ACL involves a student (curriculum learner) training on tasks of increasing difficulty controlled by a teacher (curriculum generator). Despite its success, ACL's applicability is limited by (1) the lack of a general student framework for dealing with the varying number of agents across tasks and the sparse reward problem, and (2) the non-stationarity of the teacher's task due to ever-changing student strategies. As a remedy for ACL, we introduce a novel automatic curriculum learning framework, Skilled Population Curriculum (SPC), which adapts curriculum learning to multi-agent coordination. Specifically, we endow the student with population-invariant communication and a hierarchical skill set, allowing it to learn cooperation and behavior skills from distinct tasks with varying numbers of agents. In addition, we model the teacher as a contextual bandit conditioned by student policies, enabling a team of agents to change its size while still retaining previously acquired skills. We also analyze the inherent non-stationarity of this multi-agent automatic curriculum teaching problem and provide a corresponding regret bound. Empirical results show that our method improves the performance, scalability and sample efficiency in several MARL environments.
    Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors. (arXiv:2302.03493v1 [cs.LG])
    Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.
    AMFPMC -- An improved method of detecting multiple types of drug-drug interactions using only known drug-drug interactions. (arXiv:2302.03355v1 [cs.LG])
    Adverse drug interactions are largely preventable causes of medical accidents, which frequently result in physician and emergency room encounters. The detection of drug interactions in a lab, prior to a drug's use in medical practice, is essential, however it is costly and time-consuming. Machine learning techniques can provide an efficient and accurate means of predicting possible drug-drug interactions and combat the growing problem of adverse drug interactions. Most existing models for predicting interactions rely on the chemical properties of drugs. While such models can be accurate, the required properties are not always available.
    A Categorical Archive of ChatGPT Failures. (arXiv:2302.03494v1 [cs.CL])
    Large language models have been demonstrated to be valuable in different fields. ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation by comprehending context and generating appropriate responses. It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries, with fluent and comprehensive answers surpassing prior public chatbots in both security and usefulness. However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study. Ten categories of failures, including reasoning, factual errors, math, coding, and bias, are presented and discussed. The risks, limitations, and societal implications of ChatGPT are also highlighted. The goal of this study is to assist researchers and developers in enhancing future language models and chatbots.
    Sparse GEMINI for Joint Discriminative Clustering and Feature Selection. (arXiv:2302.03391v1 [stat.ML])
    Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on $p(\pmb{x})$, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple $\ell_1$ penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a clustering model $p_\theta(y|\pmb{x})$. We demonstrate the performances of Sparse GEMINI on synthetic datasets as well as large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.
    Mismatched estimation of non-symmetric rank-one matrices corrupted by structured noise. (arXiv:2302.03306v1 [cs.IT])
    We study the performance of a Bayesian statistician who estimates a rank-one signal corrupted by non-symmetric rotationally invariant noise with a generic distribution of singular values. As the signal-to-noise ratio and the noise structure are unknown, a Gaussian setup is incorrectly assumed. We derive the exact analytic expression for the error of the mismatched Bayes estimator and also provide the analysis of an approximate message passing (AMP) algorithm. The first result exploits the asymptotic behavior of spherical integrals for rectangular matrices and of low-rank matrix perturbations; the second one relies on the design and analysis of an auxiliary AMP. The numerical experiments show that there is a performance gap between the AMP and Bayes estimators, which is due to the incorrect estimation of the signal norm.
    Diverse Probabilistic Trajectory Forecasting with Admissibility Constraints. (arXiv:2302.03462v1 [cs.LG])
    Predicting multiple trajectories for road users is important for automated driving systems: ego-vehicle motion planning indeed requires a clear view of the possible motions of the surrounding agents. However, the generative models used for multiple-trajectory forecasting suffer from a lack of diversity in their proposals. To avoid this form of collapse, we propose a novel method for structured prediction of diverse trajectories. To this end, we complement an underlying pretrained generative model with a diversity component, based on a determinantal point process (DPP). We balance and structure this diversity with the inclusion of knowledge-based quality constraints, independent from the underlying generative model. We combine these two novel components with a gating operation, ensuring that the predictions are both diverse and within the drivable area. We demonstrate on the nuScenes driving dataset the relevance of our compound approach, which yields significant improvements in the diversity and the quality of the generated trajectories.
    Ten Lessons We Have Learned in the New "Sparseland": A Short Handbook for Sparse Neural Network Researchers. (arXiv:2302.02596v2 [cs.LG] UPDATED)
    This article does not propose any novel algorithm or new hardware for sparsity. Instead, it aims to serve the "common good" for the increasingly prosperous Sparse Neural Network (SNN) research community. We attempt to summarize some most common confusions in SNNs, that one may come across in various scenarios such as paper review/rebuttal and talks - many drawn from the authors' own bittersweet experiences! We feel that doing so is meaningful and timely, since the focus of SNN research is notably shifting from traditional pruning to more diverse and profound forms of sparsity before, during, and after training. The intricate relationships between their scopes, assumptions, and approaches lead to misunderstandings, for non-experts or even experts in SNNs. In response, we summarize ten Q\&As of SNNs from many key aspects, including dense vs. sparse, unstructured sparse vs. structured sparse, pruning vs. sparse training, dense-to-sparse training vs. sparse-to-sparse training, static sparsity vs. dynamic sparsity, before-training/during-training vs. post-training sparsity, and many more. We strive to provide proper and generically applicable answers to clarify those confusions to the best extent possible. We hope our summary provides useful general knowledge for people who want to enter and engage with this exciting community; and also provides some "mind of ease" convenience for SNN researchers to explain their work in the right contexts. At the very least (and perhaps as this article's most insignificant target functionality), if you are writing/planning to write a paper or rebuttal in the field of SNNs, we hope some of our answers could help you!
    Deep-OSG: A deep learning approach for approximating a family of operators in semigroup to model unknown autonomous systems. (arXiv:2302.03358v1 [cs.LG])
    This paper proposes a novel deep learning approach for approximating evolution operators and modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z. Chen, V. Churchill, K. Wu, and D. Xiu, J. Comput. Phys., 449:110782, 2022], which focused on learning single evolution operator with a fixed time step. This paper aims to learn a family of evolution operators with variable time steps, which constitute a semigroup for an autonomous system. The semigroup property is very crucial and links the system's evolutionary behaviors across varying time scales, but it was not considered in the previous works. We propose for the first time a framework of embedding the semigroup property into the data-driven learning process, through a novel neural network architecture and new loss functions. The framework is very feasible, can be combined with any suitable neural networks, and is applicable to learning general autonomous ODEs and PDEs. We present the rigorous error estimates and variance analysis to understand the prediction accuracy and robustness of our approach, showing the remarkable advantages of semigroup awareness in our model. Moreover, our approach allows one to arbitrarily choose the time steps for prediction and ensures that the predicted results are well self-matched and consistent. Extensive numerical experiments demonstrate that embedding the semigroup property notably reduces the data dependency of deep learning models and greatly improves the accuracy, robustness, and stability for long-time prediction.
    Towards Meaningful Anomaly Detection: The Effect of Counterfactual Explanations on the Investigation of Anomalies in Multivariate Time Series. (arXiv:2302.03302v1 [cs.LG])
    Detecting rare events is essential in various fields, e.g., in cyber security or maintenance. Often, human experts are supported by anomaly detection systems as continuously monitoring the data is an error-prone and tedious task. However, among the anomalies detected may be events that are rare, e.g., a planned shutdown of a machine, but are not the actual event of interest, e.g., breakdowns of a machine. Therefore, human experts are needed to validate whether the detected anomalies are relevant. We propose to support this anomaly investigation by providing explanations of anomaly detection. Related work only focuses on the technical implementation of explainable anomaly detection and neglects the subsequent human anomaly investigation. To address this research gap, we conduct a behavioral experiment using records of taxi rides in New York City as a testbed. Participants are asked to differentiate extreme weather events from other anomalous events such as holidays or sporting events. Our results show that providing counterfactual explanations do improve the investigation of anomalies, indicating potential for explainable anomaly detection in general.
    Learning Discretized Neural Networks under Ricci Flow. (arXiv:2302.03390v1 [cs.LG])
    In this paper, we consider Discretized Neural Networks (DNNs) consisting of low-precision weights and activations, which suffer from either infinite or zero gradients caused by the non-differentiable discrete function in the training process. In this case, most training-based DNNs use the standard Straight-Through Estimator (STE) to approximate the gradient w.r.t. discrete value. However, the standard STE will cause the gradient mismatch problem, i.e., the approximated gradient direction may deviate from the steepest descent direction. In other words, the gradient mismatch implies the approximated gradient with perturbations. To address this problem, we introduce the duality theory to regard the perturbation of the approximated gradient as the perturbation of the metric in Linearly Nearly Euclidean (LNE) manifolds. Simultaneously, under the Ricci-DeTurck flow, we prove the dynamical stability and convergence of the LNE metric with the $L^2$-norm perturbation, which can provide a theoretical solution for the gradient mismatch problem. In practice, we also present the steepest descent gradient flow for DNNs on LNE manifolds from the viewpoints of the information geometry and mirror descent. The experimental results on various datasets demonstrate that our method achieves better and more stable performance for DNNs than other representative training-based methods.
    Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning. (arXiv:2302.03439v1 [cs.MA])
    Cooperative multi-agent reinforcement learning (MARL) requires agents to explore to learn to cooperate. Existing value-based MARL algorithms commonly rely on random exploration, such as $\epsilon$-greedy, which is inefficient in discovering multi-agent cooperation. Additionally, the environment in MARL appears non-stationary to any individual agent due to the simultaneous training of other agents, leading to highly variant and thus unstable optimisation signals. In this work, we propose ensemble value functions for multi-agent exploration (EMAX), a general framework to extend any value-based MARL algorithm. EMAX trains ensembles of value functions for each agent to address the key challenges of exploration and non-stationarity: (1) The uncertainty of value estimates across the ensemble is used in a UCB policy to guide the exploration of agents to parts of the environment which require cooperation. (2) Average value estimates across the ensemble serve as target values. These targets exhibit lower variance compared to commonly applied target networks and we show that they lead to more stable gradients during the optimisation. We instantiate three value-based MARL algorithms with EMAX, independent DQN, VDN and QMIX, and evaluate them in 21 tasks across four environments. Using ensembles of five value functions, EMAX improves sample efficiency and final evaluation returns of these algorithms by 54%, 55%, and 844%, respectively, averaged all 21 tasks.
    Multi-Task Recommendations with Reinforcement Learning. (arXiv:2302.03328v1 [cs.IR])
    In recent years, Multi-task Learning (MTL) has yielded immense success in Recommender System (RS) applications. However, current MTL-based recommendation models tend to disregard the session-wise patterns of user-item interactions because they are predominantly constructed based on item-wise datasets. Moreover, balancing multiple objectives has always been a challenge in this field, which is typically avoided via linear estimations in existing works. To address these issues, in this paper, we propose a Reinforcement Learning (RL) enhanced MTL framework, namely RMTL, to combine the losses of different recommendation tasks using dynamic weights. To be specific, the RMTL structure can address the two aforementioned issues by (i) constructing an MTL environment from session-wise interactions and (ii) training multi-task actor-critic network structure, which is compatible with most existing MTL-based recommendation models, and (iii) optimizing and fine-tuning the MTL loss function using the weights generated by critic networks. Experiments on two real-world public datasets demonstrate the effectiveness of RMTL with a higher AUC against state-of-the-art MTL-based recommendation models. Additionally, we evaluate and validate RMTL's compatibility and transferability across various MTL models.
    Towards a User Privacy-Aware Mobile Gaming App Installation Prediction Model. (arXiv:2302.03332v1 [cs.LG])
    Over the past decade, programmatic advertising has received a great deal of attention in the online advertising industry. A real-time bidding (RTB) system is rapidly becoming the most popular method to buy and sell online advertising impressions. Within the RTB system, demand-side platforms (DSP) aim to spend advertisers' campaign budgets efficiently while maximizing profit, seeking impressions that result in high user responses, such as clicks or installs. In the current study, we investigate the process of predicting a mobile gaming app installation from the point of view of a particular DSP, while paying attention to user privacy, and exploring the trade-off between privacy preservation and model performance. There are multiple levels of potential threats to user privacy, depending on the privacy leaks associated with the data-sharing process, such as data transformation or de-anonymization. To address these concerns, privacy-preserving techniques were proposed, such as cryptographic approaches, for training privacy-aware machine-learning models. However, the ability to train a mobile gaming app installation prediction model without using user-level data, can prevent these threats and protect the users' privacy, even though the model's ability to predict may be impaired. Additionally, current laws might force companies to declare that they are collecting data, and might even give the user the option to opt out of such data collection, which might threaten companies' business models in digital advertising, which are dependent on the collection and use of user-level data. We conclude that privacy-aware models might still preserve significant capabilities, enabling companies to make better decisions, dependent on the privacy-efficacy trade-off utility function of each case.
    Decentralized Inexact Proximal Gradient Method With Network-Independent Stepsizes for Convex Composite Optimization. (arXiv:2302.03238v1 [math.OC])
    This paper considers decentralized convex composite optimization over undirected and connected networks, where the local loss function contains both smooth and nonsmooth terms. For this problem, a novel CTA (Combine-Then-Adapt)-based decentralized algorithm is proposed under uncoordinated network-independent constant stepsizes. Particularly, the proposed algorithm only needs to approximately solve a sequence of proximal mappings, which benefits the decentralized composite optimization where the proximal mappings of the nonsmooth loss functions may not have analytic solutions. For the general convex case, we prove the O(1/k) convergence rate of the proposed algorithm, which can be improved to o(1/k) if the proximal mappings are solved exactly. Moreover, with metric subregularity, we establish the linear convergence rate. Finally, the numerical experiments demonstrate the efficiency of the algorithm.
    An Informative Path Planning Framework for Active Learning in UAV-based Semantic Mapping. (arXiv:2302.03347v1 [cs.RO])
    Unmanned aerial vehicles (UAVs) are crucial for aerial mapping and general monitoring tasks. Recent progress in deep learning enabled automated semantic segmentation of imagery to facilitate the interpretation of large-scale complex environments. Commonly used supervised deep learning for segmentation relies on large amounts of pixel-wise labelled data, which is tedious and costly to annotate. The domain-specific visual appearance of aerial environments often prevents the usage of models pre-trained on a static dataset. To address this, we propose a novel general planning framework for UAVs to autonomously acquire informative training images for model re-training. We leverage multiple acquisition functions and fuse them into probabilistic terrain maps. Our framework combines the mapped acquisition function information into the UAV's planning objectives. In this way, the UAV adaptively acquires informative aerial images to be manually labelled for model re-training. Experimental results on real-world data and in a photorealistic simulation show that our framework maximises model performance and drastically reduces labelling efforts. Our map-based planners outperform state-of-the-art local planning.
    A conceptual model for leaving the data-centric approach in machine learning. (arXiv:2302.03361v1 [cs.LG])
    For a long time, machine learning (ML) has been seen as the abstract problem of learning relationships from data independent of the surrounding settings. This has recently been challenged, and methods have been proposed to include external constraints in the machine learning models. These methods usually come from application-specific fields, such as de-biasing algorithms in the field of fairness in ML or physical constraints in the fields of physics and engineering. In this paper, we present and discuss a conceptual high-level model that unifies these approaches in a common language. We hope that this will enable and foster exchange between the different fields and their different methods for including external constraints into ML models, and thus leaving purely data-centric approaches.
    Online Reinforcement Learning with Uncertain Episode Lengths. (arXiv:2302.03608v1 [cs.LG])
    Existing episodic reinforcement algorithms assume that the length of an episode is fixed across time and known a priori. In this paper, we consider a general framework of episodic reinforcement learning when the length of each episode is drawn from a distribution. We first establish that this problem is equivalent to online reinforcement learning with general discounting where the learner is trying to optimize the expected discounted sum of rewards over an infinite horizon, but where the discounting function is not necessarily geometric. We show that minimizing regret with this new general discounting is equivalent to minimizing regret with uncertain episode lengths. We then design a reinforcement learning algorithm that minimizes regret with general discounting but acts for the setting with uncertain episode lengths. We instantiate our general bound for different types of discounting, including geometric and polynomial discounting. We also show that we can obtain similar regret bounds even when the uncertainty over the episode lengths is unknown, by estimating the unknown distribution over time. Finally, we compare our learning algorithms with existing value-iteration based episodic RL algorithms in a grid-world environment.
    A Privacy-Preserving Hybrid Federated Learning Framework for Financial Crime Detection. (arXiv:2302.03654v1 [cs.LG])
    The recent decade witnessed a surge of increase in financial crimes across the public and private sectors, with an average cost of scams of \$102m to financial institutions in 2022. Developing a mechanism for battling financial crimes is an impending task that requires in-depth collaboration from multiple institutions, and yet such collaboration imposed significant technical challenges due to the privacy and security requirements of distributed financial data. For example, consider the Society for Worldwide Interbank Financial Telecommunications (SWIFT) system, which generates 42 million transactions per day across its 11,000 global institutions. Training a detection model of fraudulent transactions requires not only secured SWIFT transactions but also the private account activities of those involved in each transaction from corresponding bank systems. The distributed nature of both samples and features prevents most existing learning systems from being directly adopted to handle the data mining task. In this paper, we collectively address these challenges by proposing a hybrid federated learning system that offers secure and privacy-aware learning and inference for financial crime detection. We conduct extensive empirical studies to evaluate the proposed framework's detection performance and privacy-protection capability, evaluating its robustness against common malicious attacks of collaborative learning. We release our source code at https://github.com/illidanlab/HyFL .
    Performative Reinforcement Learning. (arXiv:2207.00046v2 [cs.LG] UPDATED)
    We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.
    Equivariant Representation Learning via Class-Pose Decomposition. (arXiv:2207.03116v3 [cs.LG] UPDATED)
    We introduce a general method for learning representations that are equivariant to symmetries of data. Our central idea is to decompose the latent space into an invariant factor and the symmetry group itself. The components semantically correspond to intrinsic data classes and poses respectively. The learner is trained on a loss encouraging equivariance based on supervision from relative symmetry information. The approach is motivated by theoretical results from group theory and guarantees representations that are lossless, interpretable and disentangled. We provide an empirical investigation via experiments involving datasets with a variety of symmetries. Results show that our representations capture the geometry of data and outperform other equivariant representation learning frameworks.
    Heterophily-Aware Graph Attention Network. (arXiv:2302.03228v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown remarkable success in graph representation learning. Unfortunately, current weight assignment schemes in standard GNNs, such as the calculation based on node degrees or pair-wise representations, can hardly be effective in processing the networks with heterophily, in which the connected nodes usually possess different labels or features. Existing heterophilic GNNs tend to ignore the modeling of heterophily of each edge, which is also a vital part in tackling the heterophily problem. In this paper, we firstly propose a heterophily-aware attention scheme and reveal the benefits of modeling the edge heterophily, i.e., if a GNN assigns different weights to edges according to different heterophilic types, it can learn effective local attention patterns, which enable nodes to acquire appropriate information from distinct neighbors. Then, we propose a novel Heterophily-Aware Graph Attention Network (HA-GAT) by fully exploring and utilizing the local distribution as the underlying heterophily, to handle the networks with different homophily ratios. To demonstrate the effectiveness of the proposed HA-GAT, we analyze the proposed heterophily-aware attention scheme and local distribution exploration, by seeking for an interpretation from their mechanism. Extensive results demonstrate that our HA-GAT achieves state-of-the-art performances on eight datasets with different homophily ratios in both the supervised and semi-supervised node classification tasks.
    On the symmetries in the dynamics of wide two-layer neural networks. (arXiv:2211.08771v3 [cs.LG] UPDATED)
    We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $f^*$ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $f^*$ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $f^*$ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.
    Elucidating Robust Learning with Uncertainty-Aware Corruption Pattern Estimation. (arXiv:2111.01632v2 [cs.LG] UPDATED)
    Robust learning methods aim to learn a clean target distribution from noisy and corrupted training data where a specific corruption pattern is often assumed a priori. Our proposed method can not only successfully learn the clean target distribution from a dirty dataset but also can estimate the underlying noise pattern. To this end, we leverage a mixture-of-experts model that can distinguish two different types of predictive uncertainty, aleatoric and epistemic uncertainty. We show that the ability to estimate the uncertainty plays a significant role in elucidating the corruption patterns as these two objectives are tightly intertwined. We also present a novel validation scheme for evaluating the performance of the corruption pattern estimation. Our proposed method is extensively assessed in terms of both robustness and corruption pattern estimation through a number of domains, including computer vision and natural language processing.
    Gradient-based Bi-level Optimization for Deep Learning: A Survey. (arXiv:2207.11719v3 [cs.LG] UPDATED)
    Bi-level optimization, especially the gradient-based category, has been widely used in the deep learning community including hyperparameter optimization and meta knowledge extraction. Bi-level optimization embeds one problem within another and the gradient-based category solves the outer level task by computing the hypergradient, which is much more efficient than classical methods such as the evolutionary algorithm. In this survey, we first give a formal definition of the gradient-based bi-level optimization. Secondly, we illustrate how to formulate a research problem as a bi-level optimization problem, which is of great practical use for beginners. More specifically, there are two formulations: the single-task formulation to optimize hyperparameters such as regularization parameters and the distilled data, and the multi-task formulation to extract meta knowledge such as the model initialization. With a bi-level formulation, we then discuss four bi-level optimization solvers to update the outer variable including explicit gradient update, proxy update, implicit function update, and closed-form update. Last but not least, we conclude the survey by pointing out the great potential of gradient-based bi-level optimization on science problems (AI4Science).
    Automatic Sleep Stage Classification with Cross-modal Self-supervised Features from Deep Brain Signals. (arXiv:2302.03227v1 [cs.LG])
    The detection of human sleep stages is widely used in the diagnosis and intervention of neurological and psychiatric diseases. Some patients with deep brain stimulator implanted could have their neural activities recorded from the deep brain. Sleep stage classification based on deep brain recording has great potential to provide more precise treatment for patients. The accuracy and generalizability of existing sleep stage classifiers based on local field potentials are still limited. We proposed an applicable cross-modal transfer learning method for sleep stage classification with implanted devices. This end-to-end deep learning model contained cross-modal self-supervised feature representation, self-attention, and classification framework. We tested the model with deep brain recording data from 12 patients with Parkinson's disease. The best total accuracy reached 83.2% for sleep stage classification. Results showed speech self-supervised features catch the conversion pattern of sleep stages effectively. We provide a new method on transfer learning from acoustic signals to local field potentials. This method supports an effective solution for the insufficient scale of clinical data. This sleep stage classification model could be adapted to chronic and continuous monitor sleep for Parkinson's patients in daily life, and potentially utilized for more precise treatment in deep brain-machine interfaces, such as closed-loop deep brain stimulation.
    Leveraging Demonstrations to Improve Online Learning: Quality Matters. (arXiv:2302.03319v1 [cs.LG])
    We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.
    Attacking Cooperative Multi-Agent Reinforcement Learning by Adversarial Minority Influence. (arXiv:2302.03322v1 [cs.LG])
    Cooperative multi-agent reinforcement learning (c-MARL) offers a general paradigm for a group of agents to achieve a shared goal by taking individual decisions, yet is found to be vulnerable to adversarial attacks. Though harmful, adversarial attacks also play a critical role in evaluating the robustness and finding blind spots of c-MARL algorithms. However, existing attacks are not sufficiently strong and practical, which is mainly due to the ignorance of complex influence between agents and cooperative nature of victims in c-MARL. In this paper, we propose adversarial minority influence (AMI), the first practical attack against c-MARL by introducing an adversarial agent. AMI addresses the aforementioned problems by unilaterally influencing other cooperative victims to a targeted worst-case cooperation. Technically, to maximally deviate victim policy under complex agent-wise influence, our unilateral attack characterize and maximize the influence from adversary to victims. This is done by adapting a unilateral agent-wise relation metric derived from mutual information, which filters out the detrimental influence from victims to adversary. To fool victims into a jointly worst-case failure, our targeted attack influence victims to a long-term, cooperatively worst case by distracting each victim to a specific target. Such target is learned by a reinforcement learning agent in a trial-and-error process. Extensive experiments in simulation environments, including discrete control (SMAC), continuous control (MAMujoco) and real-world robot swarm control demonstrate the superiority of our AMI approach. Our codes are available in https://anonymous.4open.science/r/AMI.
    Phase Transitions in the Detection of Correlated Databases. (arXiv:2302.03380v1 [cs.LG])
    We study the problem of detecting the correlation between two Gaussian databases $\mathsf{X}\in\mathbb{R}^{n\times d}$ and $\mathsf{Y}^{n\times d}$, each composed of $n$ users with $d$ features. This problem is relevant in the analysis of social media, computational biology, etc. We formulate this as a hypothesis testing problem: under the null hypothesis, these two databases are statistically independent. Under the alternative, however, there exists an unknown permutation $\sigma$ over the set of $n$ users (or, row permutation), such that $\mathsf{X}$ is $\rho$-correlated with $\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$. We determine sharp thresholds at which optimal testing exhibits a phase transition, depending on the asymptotic regime of $n$ and $d$. Specifically, we prove that if $\rho^2d\to0$, as $d\to\infty$, then weak detection (performing slightly better than random guessing) is statistically impossible, irrespectively of the value of $n$. This compliments the performance of a simple test that thresholds the sum all entries of $\mathsf{X}^T\mathsf{Y}$. Furthermore, when $d$ is fixed, we prove that strong detection (vanishing error probability) is impossible for any $\rho<\rho^\star$, where $\rho^\star$ is an explicit function of $d$, while weak detection is again impossible as long as $\rho^2d\to0$. These results close significant gaps in current recent related studies.
    Reducing SO(3) Convolutions to SO(2) for Efficient Equivariant GNNs. (arXiv:2302.03655v1 [cs.LG])
    Graph neural networks that model 3D data, such as point clouds or atoms, are typically desired to be $SO(3)$ equivariant, i.e., equivariant to 3D rotations. Unfortunately equivariant convolutions, which are a fundamental operation for equivariant networks, increase significantly in computational complexity as higher-order tensors are used. In this paper, we address this issue by reducing the $SO(3)$ convolutions or tensor products to mathematically equivalent convolutions in $SO(2)$ . This is accomplished by aligning the node embeddings' primary axis with the edge vectors, which sparsifies the tensor product and reduces the computational complexity from $O(L^6)$ to $O(L^3)$, where $L$ is the degree of the representation. We demonstrate the potential implications of this improvement by proposing the Equivariant Spherical Channel Network (eSCN), a graph neural network utilizing our novel approach to equivariant convolutions, which achieves state-of-the-art results on the large-scale OC-20 dataset.
    Data augmentation for machine learning of chemical process flowsheets. (arXiv:2302.03379v1 [cs.LG])
    Artificial intelligence has great potential for accelerating the design and engineering of chemical processes. Recently, we have shown that transformer-based language models can learn to auto-complete chemical process flowsheets using the SFILES 2.0 string notation. Also, we showed that language translation models can be used to translate Process Flow Diagrams (PFDs) into Process and Instrumentation Diagrams (P&IDs). However, artificial intelligence methods require big data and flowsheet data is currently limited. To mitigate this challenge of limited data, we propose a new data augmentation methodology for flowsheet data that is represented in the SFILES 2.0 notation. We show that the proposed data augmentation improves the performance of artificial intelligence-based process design models. In our case study flowsheet data augmentation improved the prediction uncertainty of the flowsheet autocompletion model by 14.7%. In the future, our flowsheet data augmentation can be used for other machine learning algorithms on chemical process flowsheets that are based on SFILES notation.
    Reply to: Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. (arXiv:2302.03602v1 [cond-mat.dis-nn])
    We provide a comprehensive reply to the comment written by Chiara Angelini and Federico Ricci-Tersenghi [arXiv:2206.13211] and argue that the comment singles out one particular non-representative example problem, entirely focusing on the maximum independent set (MIS) on sparse graphs, for which greedy algorithms are expected to perform well. Conversely, we highlight the broader algorithmic development underlying our original work, and (within our original framework) provide additional numerical results showing sizable improvements over our original results, thereby refuting the comment's performance statements. We also provide results showing run-time scaling superior to the results provided by Angelini and Ricci-Tersenghi. Furthermore, we show that the proposed set of random d-regular graphs does not provide a universal set of benchmark instances, nor do greedy heuristics provide a universal algorithmic baseline. Finally, we argue that the internal (parallel) anatomy of graph neural networks is very different from the (sequential) nature of greedy algorithms and emphasize that graph neural networks have demonstrated their potential for superior scalability compared to existing heuristics such as parallel tempering. We conclude by discussing the conceptual novelty of our work and outline some potential extensions.
    A One-Size-Fits-All Solution to Conservative Bandit Problems. (arXiv:2012.07341v4 [cs.LG] UPDATED)
    In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees ($T$-independent additive regrets instead of $T$-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with $O(1/T)$ normalized additive regrets ($T$-independent in the cumulative form) and validate this result through empirical evaluation.
    Transfer learning for process design with reinforcement learning. (arXiv:2302.03375v1 [cs.LG])
    Process design is a creative task that is currently performed manually by engineers. Artificial intelligence provides new potential to facilitate process design. Specifically, reinforcement learning (RL) has shown some success in automating process design by integrating data-driven models that learn to build process flowsheets with process simulation in an iterative design process. However, one major challenge in the learning process is that the RL agent demands numerous process simulations in rigorous process simulators, thereby requiring long simulation times and expensive computational power. Therefore, typically short-cut simulation methods are employed to accelerate the learning process. Short-cut methods can, however, lead to inaccurate results. We thus propose to utilize transfer learning for process design with RL in combination with rigorous simulation methods. Transfer learning is an established approach from machine learning that stores knowledge gained while solving one problem and reuses this information on a different target domain. We integrate transfer learning in our RL framework for process design and apply it to an illustrative case study comprising equilibrium reactions, azeotropic separation, and recycles, our method can design economically feasible flowsheets with stable interaction with DWSIM. Our results show that transfer learning enables RL to economically design feasible flowsheets with DWSIM, resulting in a flowsheet with an 8% higher revenue. And the learning time can be reduced by a factor of 2.
    Machine learning benchmarks for the classification of equivalent circuit models from solid-state electrochemical impedance spectra. (arXiv:2302.03362v1 [cs.LG])
    Analysis of Electrochemical Impedance Spectroscopy (EIS) data for electrochemical systems often consists of defining an Equivalent Circuit Model (ECM) using expert knowledge and then optimizing the model parameters to deconvolute various resistance, capacitive, inductive, or diffusion responses. For small data sets, this procedure can be conducted manually; however, it is not feasible to manually define a proper ECM for extensive data sets with a wide range of EIS responses. Automatic identification of an ECM would substantially accelerate the analysis of large sets of EIS data. Here, we showcase machine learning methods developed during the BatteryDEV hackathon to classify the ECMs of 9,300 EIS measurements provided by QuantumScape. The best-performing approach is a gradient-boosted tree model utilizing a library to automatically generate features, followed by a random forest model using the raw spectral data. A convolutional neural network using boolean images of Nyquist representations is presented as an alternative, although it achieves a lower accuracy. We publish the data and open source the associated code. The approaches described in this article can serve as benchmarks for further studies. A key remaining challenge is that the labels contain uncertainty and human bias, underlined by the performance of the trained models.
    Med-NCA: Robust and Lightweight Segmentation with Neural Cellular Automata. (arXiv:2302.03473v1 [eess.IV])
    Access to the proper infrastructure is critical when performing medical image segmentation with Deep Learning. This requirement makes it difficult to run state-of-the-art segmentation models in resource-constrained scenarios like primary care facilities in rural areas and during crises. The recently emerging field of Neural Cellular Automata (NCA) has shown that locally interacting one-cell models can achieve competitive results in tasks such as image generation or segmentations in low-resolution inputs. However, they are constrained by high VRAM requirements and the difficulty of reaching convergence for high-resolution images. To counteract these limitations we propose Med-NCA, an end-to-end NCA training pipeline for high-resolution image segmentation. Our method follows a two-step process. Global knowledge is first communicated between cells across the downscaled image. Following that, patch-based segmentation is performed. Our proposed Med-NCA outperforms the classic UNet by 2% and 3% Dice for hippocampus and prostate segmentation, respectively, while also being 500 times smaller. We also show that Med-NCA is by design invariant with respect to image scale, shape and translation, experiencing only slight performance degradation even with strong shifts; and is robust against MRI acquisition artefacts. Med-NCA enables high-resolution medical image segmentation even on a Raspberry Pi B+, arguably the smallest device able to run PyTorch and that can be powered by a standard power bank.
    Representation Theory for Geometric Quantum Machine Learning. (arXiv:2210.07980v2 [quant-ph] UPDATED)
    Recent advances in classical machine learning have shown that creating models with inductive biases encoding the symmetries of a problem can greatly improve performance. Importation of these ideas, combined with an existing rich body of work at the nexus of quantum theory and symmetry, has given rise to the field of Geometric Quantum Machine Learning (GQML). Following the success of its classical counterpart, it is reasonable to expect that GQML will play a crucial role in developing problem-specific and quantum-aware models capable of achieving a computational advantage. Despite the simplicity of the main idea of GQML -- create architectures respecting the symmetries of the data -- its practical implementation requires a significant amount of knowledge of group representation theory. We present an introduction to representation theory tools from the optics of quantum learning, driven by key examples involving discrete and continuous groups. These examples are sewn together by an exposition outlining the formal capture of GQML symmetries via "label invariance under the action of a group representation", a brief (but rigorous) tour through finite and compact Lie group representation theory, a reexamination of ubiquitous tools like Haar integration and twirling, and an overview of some successful strategies for detecting symmetries.  ( 2 min )
    Undersampling and Cumulative Class Re-decision Methods to Improve Detection of Agitation in People with Dementia. (arXiv:2302.03224v1 [cs.LG])
    Agitation is one of the most prevalent symptoms in people with dementia (PwD) that can place themselves and the caregiver's safety at risk. Developing objective agitation detection approaches is important to support health and safety of PwD living in a residential setting. In a previous study, we collected multimodal wearable sensor data from 17 participants for 600 days and developed machine learning models for predicting agitation in one-minute windows. However, there are significant limitations in the dataset, such as imbalance problem and potential imprecise labels as the occurrence of agitation is much rarer in comparison to the normal behaviours. In this paper, we first implement different undersampling methods to eliminate the imbalance problem, and come to the conclusion that only 20% of normal behaviour data are adequate to train a competitive agitation detection model. Then, we design a weighted undersampling method to evaluate the manual labeling mechanism given the ambiguous time interval (ATI) assumption. After that, the postprocessing method of cumulative class re-decision (CCR) is proposed based on the historical sequential information and continuity characteristic of agitation, improving the decision-making performance for the potential application of agitation detection system. The results show that a combination of undersampling and CCR improves best F1-score by 26.6% and other metrics to varying degrees with less training time and data used, and inspires a way to find the potential range of optimal threshold reference for clinical purpose.  ( 2 min )
    Autodecompose: A generative self-supervised model for semantic decomposition. (arXiv:2302.03124v1 [cs.LG])
    We introduce Autodecompose, a novel self-supervised generative model that decomposes data into two semantically independent properties: the desired property, which captures a specific aspect of the data (e.g. the voice in an audio signal), and the context property, which aggregates all other information (e.g. the content of the audio signal), without any labels given. Autodecompose uses two complementary augmentations, one that manipulates the context while preserving the desired property and the other that manipulates the desired property while preserving the context. The augmented variants of the data are encoded by two encoders and reconstructed by a decoder. We prove that one of the encoders embeds the desired property while the other embeds the context property. We apply Autodecompose to audio signals to encode sound source (human voice) and content. We pre-trained the model on YouTube and LibriSpeech datasets and fine-tuned in a self-supervised manner without exposing the labels. Our results showed that, using the sound source encoder of pre-trained Autodecompose, a linear classifier achieves F1 score of 97.6\% in recognizing the voice of 30 speakers using only 10 seconds of labeled samples, compared to 95.7\% for supervised models. Additionally, our experiments showed that Autodecompose is robust against overfitting even when a large model is pre-trained on a small dataset. A large Autodecompose model was pre-trained from scratch on 60 seconds of audio from 3 speakers achieved over 98.5\% F1 score in recognizing those three speakers in other unseen utterances. We finally show that the context encoder embeds information about the content of the speech and ignores the sound source information. Our sample code for training the model, as well as examples for using the pre-trained models are available here: \url{https://github.com/rezabonyadi/autodecompose}  ( 2 min )
    LUT-NN: Towards Unified Neural Network Inference by Table Lookup. (arXiv:2302.03213v1 [cs.LG])
    DNN inference requires huge effort of system development and resource cost. This drives us to propose LUT-NN, the first trial towards empowering deep neural network (DNN) inference by table lookup, to eliminate the diverse computation kernels as well as save running cost. Based on the feature similarity of each layer, LUT-NN can learn the typical features, named centroids, of each layer from the training data, precompute them with model weights, and save the results in tables. For future input, the results of the closest centroids with the input features can be directly read from the table, as the approximation of layer output. We propose the novel centroid learning technique for DNN, which enables centroid learning through backpropagation, and adapts three levels of approximation to minimize the model loss. By this technique, LUT-NN achieves comparable accuracy (<5% difference) with original models on real complex dataset, including CIFAR, ImageNet, and GLUE. LUT-NN simplifies the computing operators to only two: closest centroid search and table lookup. We implement them for Intel and ARM CPUs. The model size is reduced by up to 3.5x for CNN models and 7x for BERT. Latency-wise, the real speedup of LUT-NN is up to 7x for BERT and 2x for ResNet, much lower than theoretical results because of the current unfriendly hardware design for table lookup. We expect firstclass table lookup support in the future to unleash the potential of LUT-NN.  ( 2 min )
    Delving Deep into Simplicity Bias for Long-Tailed Image Recognition. (arXiv:2302.03264v1 [cs.CV])
    Simplicity Bias (SB) is a phenomenon that deep neural networks tend to rely favorably on simpler predictive patterns but ignore some complex features when applied to supervised discriminative tasks. In this work, we investigate SB in long-tailed image recognition and find the tail classes suffer more severely from SB, which harms the generalization performance of such underrepresented classes. We empirically report that self-supervised learning (SSL) can mitigate SB and perform in complementary to the supervised counterpart by enriching the features extracted from tail samples and consequently taking better advantage of such rare samples. However, standard SSL methods are designed without explicitly considering the inherent data distribution in terms of classes and may not be optimal for long-tailed distributed data. To address this limitation, we propose a novel SSL method tailored to imbalanced data. It leverages SSL by triple diverse levels, i.e., holistic-, partial-, and augmented-level, to enhance the learning of predictive complex patterns, which provides the potential to overcome the severe SB on tail data. Both quantitative and qualitative experimental results on five long-tailed benchmark datasets show our method can effectively mitigate SB and significantly outperform the competing state-of-the-arts.  ( 2 min )
    Self-learning Machines based on Hamiltonian Echo Backpropagation. (arXiv:2103.04992v2 [cs.LG] UPDATED)
    A physical self-learning machine can be defined as a nonlinear dynamical system that can be trained on data (similar to artificial neural networks), but where the update of the internal degrees of freedom that serve as learnable parameters happens autonomously. In this way, neither external processing and feedback nor knowledge of (and control of) these internal degrees of freedom is required. We introduce a general scheme for self-learning in any time-reversible Hamiltonian system. We illustrate the training of such a self-learning machine numerically for the case of coupled nonlinear wave fields.  ( 2 min )
    Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR. (arXiv:2302.03201v1 [cs.LG])
    In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $\tau$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $\Omega(\sqrt{\tau^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $\Omega(\sqrt{\tau^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{\tau^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(\tau^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $\tau$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.  ( 2 min )
    Scalable Gaussian process regression enables accurate prediction of protein and small molecule properties with uncertainty quantitation. (arXiv:2302.03294v1 [cs.LG])
    Gaussian process (GP) is a Bayesian model which provides several advantages for regression tasks in machine learning such as reliable quantitation of uncertainty and improved interpretability. Their adoption has been precluded by their excessive computational cost and by the difficulty in adapting them for analyzing sequences (e.g. amino acid and nucleotide sequences) and graphs (e.g. ones representing small molecules). In this study, we develop efficient and scalable approaches for fitting GP models as well as fast convolution kernels which scale linearly with graph or sequence size. We implement these improvements by building an open-source Python library called xGPR. We compare the performance of xGPR with the reported performance of various deep learning models on 20 benchmarks, including small molecule, protein sequence and tabular data. We show that xGRP achieves highly competitive performance with much shorter training time. Furthermore, we also develop new kernels for sequence and graph data and show that xGPR generally outperforms convolutional neural networks on predicting key properties of proteins and small molecules. Importantly, xGPR provides uncertainty information not available from typical deep learning models. Additionally, xGPR provides a representation of the input data that can be used for clustering and data visualization. These results demonstrate that xGPR provides a powerful and generic tool that can be broadly useful in protein engineering and drug discovery.  ( 2 min )
    Data Selection for Language Models via Importance Resampling. (arXiv:2302.03169v1 [cs.CL])
    Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.  ( 2 min )
    On the Ideal Number of Groups for Isometric Gradient Propagation. (arXiv:2302.03193v1 [cs.LG])
    Recently, various normalization layers have been proposed to stabilize the training of deep neural networks. Among them, group normalization is a generalization of layer normalization and instance normalization by allowing a degree of freedom in the number of groups it uses. However, to determine the optimal number of groups, trial-and-error-based hyperparameter tuning is required, and such experiments are time-consuming. In this study, we discuss a reasonable method for setting the number of groups. First, we find that the number of groups influences the gradient behavior of the group normalization layer. Based on this observation, we derive the ideal number of groups, which calibrates the gradient scale to facilitate gradient descent optimization. Our proposed number of groups is theoretically grounded, architecture-aware, and can provide a proper value in a layer-wise manner for all layers. The proposed method exhibited improved performance over existing methods in numerous neural network architectures, tasks, and datasets.  ( 2 min )
    SDYN-GANs: Adversarial Learning Methods for Multistep Generative Models for General Order Stochastic Dynamics. (arXiv:2302.03663v1 [cs.LG])
    We introduce adversarial learning methods for data-driven generative modeling of the dynamics of $n^{th}$-order stochastic systems. Our approach builds on Generative Adversarial Networks (GANs) with generative model classes based on stable $m$-step stochastic numerical integrators. We introduce different formulations and training methods for learning models of stochastic dynamics based on observation of trajectory samples. We develop approaches using discriminators based on Maximum Mean Discrepancy (MMD), training protocols using conditional and marginal distributions, and methods for learning dynamic responses over different time-scales. We show how our approaches can be used for modeling physical systems to learn force-laws, damping coefficients, and noise-related parameters. The adversarial learning approaches provide methods for obtaining stable generative models for dynamic tasks including long-time prediction and developing simulations for stochastic systems.  ( 2 min )
    How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?. (arXiv:2302.03679v1 [cs.LG])
    Many important computer vision applications are naturally formulated as regression problems. Within medical imaging, accurate regression models have the potential to automate various tasks, helping to lower costs and improve patient outcomes. Such safety-critical deployment does however require reliable estimation of model uncertainty, also under the wide variety of distribution shifts that might be encountered in practice. Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets. This uncovers important limitations of current uncertainty estimation methods, and the proposed benchmark therefore serves as a challenge to the research community. We hope that our benchmark will spur more work on how to develop truly reliable regression uncertainty estimation methods. Code is available at https://github.com/fregu856/regression_uncertainty.  ( 2 min )
    Making Intelligence: Ethical Values in IQ and ML Benchmarks. (arXiv:2209.00692v3 [cs.LG] UPDATED)
    In recent years, ML researchers have wrestled with defining and improving machine learning (ML) benchmarks and datasets. In parallel, some have trained a critical lens on the ethics of dataset creation and ML research. In this position paper, we highlight the entanglement of ethics with seemingly ``technical'' or ``scientific'' decisions about the design of ML benchmarks. Our starting point is the existence of multiple overlooked structural similarities between human intelligence benchmarks and ML benchmarks. Both types of benchmarks set standards for describing, evaluating, and comparing performance on tasks relevant to intelligence -- standards that many scholars of human intelligence have long recognized as value-laden. We use perspectives from feminist philosophy of science on IQ benchmarks and thick concepts in social science to argue that values need to be considered and documented when creating ML benchmarks. It is neither possible nor desirable to avoid this choice by creating value-neutral benchmarks. Finally, we outline practical recommendations for ML benchmark research ethics and ethics review.  ( 2 min )
    Mind the Gap! Bridging Explainable Artificial Intelligence and Human Understanding with Luhmann's Functional Theory of Communication. (arXiv:2302.03460v1 [cs.CY])
    Over the past decade explainable artificial intelligence has evolved from a predominantly technical discipline into a field that is deeply intertwined with social sciences. Insights such as human preference for contrastive -- more precisely, counterfactual -- explanations have played a major role in this transition, inspiring and guiding the research in computer science. Other observations, while equally important, have received much less attention. The desire of human explainees to communicate with artificial intelligence explainers through a dialogue-like interaction has been mostly neglected by the community. This poses many challenges for the effectiveness and widespread adoption of such technologies as delivering a single explanation optimised according to some predefined objectives may fail to engender understanding in its recipients and satisfy their unique needs given the diversity of human knowledge and intention. Using insights elaborated by Niklas Luhmann and, more recently, Elena Esposito we apply social systems theory to highlight challenges in explainable artificial intelligence and offer a path forward, striving to reinvigorate the technical research in this direction. This paper aims to demonstrate the potential of systems theoretical approaches to communication in understanding problems and limitations of explainable artificial intelligence.  ( 2 min )
    Can gamification reduce the burden of self-reporting in mHealth applications? Feasibility study using machine learning from smartwatch data to estimate cognitive load. (arXiv:2302.03616v1 [cs.LG])
    The effectiveness of digital treatments can be measured by requiring patients to self-report their mental and physical state through mobile applications. However, self-reporting can be overwhelming and may cause patients to disengage from the intervention. In order to address this issue, we conduct a feasibility study to explore the impact of gamification on the cognitive burden of self-reporting. Our approach involves the creation of a system to assess cognitive burden through the analysis of photoplethysmography (PPG) signals obtained from a smartwatch. The system is built by collecting PPG data during both cognitively demanding tasks and periods of rest. The obtained data is utilized to train a machine learning model to detect cognitive load (CL). Subsequently, we create two versions of health surveys: a gamified version and a traditional version. Our aim is to estimate the cognitive load experienced by participants while completing these surveys using their mobile devices. We find that CL detector performance can be enhanced via pre-training on stress detection tasks and requires capturing of a minimum 30 seconds of PPG signal to work adequately. For 10 out of 13 participants, a personalized cognitive load detector can achieve an F1 score above 0.7. We find no difference between the gamified and non-gamified mobile surveys in terms of time spent in the state of high cognitive load but participants prefer the gamified version. The average time spent on each question is 5.5 for gamified survey vs 6 seconds for the non-gamified version.  ( 2 min )
    On the relationship between multivariate splines and infinitely-wide neural networks. (arXiv:2302.03459v1 [cs.LG])
    We consider multivariate splines and show that they have a random feature expansion as infinitely wide neural networks with one-hidden layer and a homogeneous activation function which is the power of the rectified linear unit. We show that the associated function space is a Sobolev space on a Euclidean ball, with an explicit bound on the norms of derivatives. This link provides a new random feature expansion for multivariate splines that allow efficient algorithms. This random feature expansion is numerically better behaved than usual random Fourier features, both in theory and practice. In particular, in dimension one, we compare the associated leverage scores to compare the two random expansions and show a better scaling for the neural network expansion.  ( 2 min )
    Temporal Robustness against Data Poisoning. (arXiv:2302.03684v1 [cs.LG])
    Data poisoning considers cases when an adversary maliciously inserts and removes training data to manipulate the behavior of machine learning algorithms. Traditional threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, existing defenses are essentially vulnerable in practice when poisoning more samples remains a feasible option for attackers. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning and derive two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. With these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal modeling of data poisoning.  ( 2 min )
    Averaged Method of Multipliers for Bi-Level Optimization without Lower-Level Strong Convexity. (arXiv:2302.03407v1 [math.OC])
    Gradient methods have become mainstream techniques for Bi-Level Optimization (BLO) in learning fields. The validity of existing works heavily rely on either a restrictive Lower- Level Strong Convexity (LLSC) condition or on solving a series of approximation subproblems with high accuracy or both. In this work, by averaging the upper and lower level objectives, we propose a single loop Bi-level Averaged Method of Multipliers (sl-BAMM) for BLO that is simple yet efficient for large-scale BLO and gets rid of the limited LLSC restriction. We further provide non-asymptotic convergence analysis of sl-BAMM towards KKT stationary points, and the comparative advantage of our analysis lies in the absence of strong gradient boundedness assumption, which is always required by others. Thus our theory safely captures a wider variety of applications in deep learning, especially where the upper-level objective is quadratic w.r.t. the lower-level variable. Experimental results demonstrate the superiority of our method.  ( 2 min )
    Natural Language Processing for Policymaking. (arXiv:2302.03490v1 [cs.CL])
    Language is the medium for many political activities, from campaigns to news reports. Natural language processing (NLP) uses computational tools to parse text into key information that is needed for policymaking. In this chapter, we introduce common methods of NLP, including text classification, topic modeling, event extraction, and text scaling. We then overview how these methods can be used for policymaking through four major applications including data collection for evidence-based policymaking, interpretation of political decisions, policy communication, and investigation of policy effects. Finally, we highlight some potential limitations and ethical concerns when using NLP for policymaking. This text is from Chapter 7 (pages 141-162) of the Handbook of Computational Social Science for Policy (2023). Open Access on Springer: https://doi.org/10.1007/978-3-031-16624-2
    Efficient XAI Techniques: A Taxonomic Survey. (arXiv:2302.03225v1 [cs.LG])
    Recently, there has been a growing demand for the deployment of Explainable Artificial Intelligence (XAI) algorithms in real-world applications. However, traditional XAI methods typically suffer from a high computational complexity problem, which discourages the deployment of real-time systems to meet the time-demanding requirements of real-world scenarios. Although many approaches have been proposed to improve the efficiency of XAI methods, a comprehensive understanding of the achievements and challenges is still needed. To this end, in this paper we provide a review of efficient XAI. Specifically, we categorize existing techniques of XAI acceleration into efficient non-amortized and efficient amortized methods. The efficient non-amortized methods focus on data-centric or model-centric acceleration upon each individual instance. In contrast, amortized methods focus on learning a unified distribution of model explanations, following the predictive, generative, or reinforcement frameworks, to rapidly derive multiple model explanations. We also analyze the limitations of an efficient XAI pipeline from the perspectives of the training phase, the deployment phase, and the use scenarios. Finally, we summarize the challenges of deploying XAI acceleration methods to real-world scenarios, overcoming the trade-off between faithfulness and efficiency, and the selection of different acceleration methods.
    An End-to-End Two-Phase Deep Learning-Based workflow to Segment Man-made Objects Around Reservoirs. (arXiv:2302.03282v1 [cs.CV])
    Reservoirs are fundamental infrastructures for the management of water resources. Constructions around them can negatively impact their quality. Such unauthorized constructions can be monitored by land cover mapping (LCM) remote sensing (RS) images. In this paper, we develop a new approach based on DL and image processing techniques for man-made object segmentation around the reservoirs. In order to segment man-made objects around the reservoirs in an end-to-end procedure, segmenting reservoirs and identifying the region of interest (RoI) around them are essential. In the proposed two-phase workflow, the reservoir is initially segmented using a DL model. A post-processing stage is proposed to remove errors such as floating vegetation. Next, the RoI around the reservoir (RoIaR) is identified using the proposed image processing techniques. Finally, the man-made objects in the RoIaR are segmented using a DL architecture. We trained the proposed workflow using collected Google Earth (GE) images of eight reservoirs in Brazil over two different years. The U-Net-based and SegNet-based architectures are trained to segment the reservoirs. To segment man-made objects in the RoIaR, we trained and evaluated four possible architectures, U-Net, FPN, LinkNet, and PSPNet. Although the collected data has a high diversity (for example, they belong to different states, seasons, resolutions, etc.), we achieved good performances in both phases. Furthermore, applying the proposed post-processing to the output of reservoir segmentation improves the precision in all studied reservoirs except two cases. We validated the prepared workflow with a reservoir dataset outside the training reservoirs. The results show high generalization ability of the prepared workflow.  ( 2 min )
    Unsupervised Deep Learning for IoT Time Series. (arXiv:2302.03284v1 [cs.LG])
    IoT time series analysis has found numerous applications in a wide variety of areas, ranging from health informatics to network security. Nevertheless, the complex spatial temporal dynamics and high dimensionality of IoT time series make the analysis increasingly challenging. In recent years, the powerful feature extraction and representation learning capabilities of deep learning (DL) have provided an effective means for IoT time series analysis. However, few existing surveys on time series have systematically discussed unsupervised DL-based methods. To fill this void, we investigate unsupervised deep learning for IoT time series, i.e., unsupervised anomaly detection and clustering, under a unified framework. We also discuss the application scenarios, public datasets, existing challenges, and future research directions in this area.  ( 2 min )
    IB-UQ: Information bottleneck based uncertainty quantification for neural function regression and neural operator learning. (arXiv:2302.03271v1 [math.NA])
    In this paper, a novel framework is established for uncertainty quantification via information bottleneck (IB-UQ) for scientific machine learning tasks, including deep neural network (DNN) regression and neural operator learning (DeepONet). Specifically, we first employ the General Incompressible-Flow Networks (GIN) model to learn a "wide" distribution fromnoisy observation data. Then, following the information bottleneck objective, we learn a stochastic map from input to some latent representation that can be used to predict the output. A tractable variational bound on the IB objective is constructed with a normalizing flow reparameterization. Hence, we can optimize the objective using the stochastic gradient descent method. IB-UQ can provide both mean and variance in the label prediction by explicitly modeling the representation variables. Compared to most DNN regression methods and the deterministic DeepONet, the proposed model can be trained on noisy data and provide accurate predictions with reliable uncertainty estimates on unseen noisy data. We demonstrate the capability of the proposed IB-UQ framework via several representative examples, including discontinuous function regression, real-world dataset regression and learning nonlinear operators for diffusion-reaction partial differential equation.  ( 2 min )
    Utility-based Perturbed Gradient Descent: An Optimizer for Continual Learning. (arXiv:2302.03281v1 [cs.LG])
    Modern representation learning methods may fail to adapt quickly under non-stationarity since they suffer from the problem of catastrophic forgetting and decaying plasticity. Such problems prevent learners from fast adaptation to changes since they result in increasing numbers of saturated features and forgetting useful features when presented with new experiences. Hence, these methods are rendered ineffective for continual learning. This paper proposes Utility-based Perturbed Gradient Descent (UPGD), an online representation-learning algorithm well-suited for continual learning agents with no knowledge about task boundaries. UPGD protects useful weights or features from forgetting and perturbs less useful ones based on their utilities. Our empirical results show that UPGD alleviates catastrophic forgetting and decaying plasticity, enabling modern representation learning methods to work in the continual learning setting.  ( 2 min )
    Learning to Count Isomorphisms with Graph Neural Networks. (arXiv:2302.03266v1 [cs.LG])
    Subgraph isomorphism counting is an important problem on graphs, as many graph-based tasks exploit recurring subgraph patterns. Classical methods usually boil down to a backtracking framework that needs to navigate a huge search space with prohibitive computational costs. Some recent studies resort to graph neural networks (GNNs) to learn a low-dimensional representation for both the query and input graphs, in order to predict the number of subgraph isomorphisms on the input graph. However, typical GNNs employ a node-centric message passing scheme that receives and aggregates messages on nodes, which is inadequate in complex structure matching for isomorphism counting. Moreover, on an input graph, the space of possible query graphs is enormous, and different parts of the input graph will be triggered to match different queries. Thus, expecting a fixed representation of the input graph to match diversely structured query graphs is unrealistic. In this paper, we propose a novel GNN called Count-GNN for subgraph isomorphism counting, to deal with the above challenges. At the edge level, given that an edge is an atomic unit of encoding graph structures, we propose an edge-centric message passing scheme, where messages on edges are propagated and aggregated based on the edge adjacency to preserve fine-grained structural information. At the graph level, we modulate the input graph representation conditioned on the query, so that the input graph can be adapted to each query individually to improve their matching. Finally, we conduct extensive experiments on a number of benchmark datasets to demonstrate the superior performance of Count-GNN.  ( 2 min )
    Climate Intervention Analysis using AI Model Guided by Statistical Physics Principles. (arXiv:2302.03258v1 [cs.LG])
    The availability of training data remains a significant obstacle for the implementation of machine learning in scientific applications. In particular, estimating how a system might respond to external forcings or perturbations requires specialized labeled data or targeted simulations, which may be computationally intensive to generate at scale. In this study, we propose a novel solution to this challenge by utilizing a principle from statistical physics known as the Fluctuation-Dissipation Theorem (FDT) to discover knowledge using an AI model that can rapidly produce scenarios for different external forcings. By leveraging FDT, we are able to extract information encoded in a large dataset produced by Earth System Models, which includes 8250 years of internal climate fluctuations, to estimate the climate system's response to forcings. Our model, AiBEDO, is capable of capturing the complex, multi-timescale effects of radiation perturbations on global and regional surface climate, allowing for a substantial acceleration of the exploration of the impacts of spatially-heterogenous climate forcers. To demonstrate the utility of AiBEDO, we use the example of a climate intervention technique called Marine Cloud Brightening, with the ultimate goal of optimizing the spatial pattern of cloud brightening to achieve regional climate targets and prevent known climate tipping points. While we showcase the effectiveness of our approach in the context of climate science, it is generally applicable to other scientific disciplines that are limited by the extensive computational demands of domain simulation models. Source code of AiBEDO framework is made available at https://github.com/kramea/kdd_aibedo. A sample dataset is made available at https://doi.org/10.5281/zenodo.7597027. Additional data available upon request.  ( 2 min )
    ClueGAIN: Application of Transfer Learning On Generative Adversarial Imputation Nets (GAIN). (arXiv:2302.03140v1 [cs.LG])
    Many studies have attempted to solve the problem of missing data using various approaches. Among them, Generative Adversarial Imputation Nets (GAIN) was first used to impute data with Generative Adversarial Nets (GAN) and good results were obtained. Subsequent studies have attempted to combine various approaches to address some of its limitations. ClueGAIN is first proposed in this study, which introduces transfer learning into GAIN to solve the problem of poor imputation performance in high missing rate data sets. ClueGAIN can also be used to measure the similarity between data sets to explore their potential connections.  ( 2 min )
    Spatial Functa: Scaling Functa to ImageNet Classification and Generation. (arXiv:2302.03130v1 [cs.LG])
    Neural fields, also known as implicit neural representations, have emerged as a powerful means to represent complex signals of various modalities. Based on this Dupont et al. (2022) introduce a framework that views neural fields as data, termed *functa*, and proposes to do deep learning directly on this dataset of neural fields. In this work, we show that the proposed framework faces limitations when scaling up to even moderately complex datasets such as CIFAR-10. We then propose *spatial functa*, which overcome these limitations by using spatially arranged latent representations of neural fields, thereby allowing us to scale up the approach to ImageNet-1k at 256x256 resolution. We demonstrate competitive performance to Vision Transformers (Steiner et al., 2022) on classification and Latent Diffusion (Rombach et al., 2022) on image generation respectively.  ( 2 min )
    CDANs: Temporal Causal Discovery from Autocorrelated and Non-Stationary Time Series Data. (arXiv:2302.03246v1 [cs.LG])
    This study presents a novel constraint-based causal discovery approach for autocorrelated and non-stationary time series data (CDANs). Our proposed method addresses several limitations of existing causal discovery methods for autocorrelated and non-stationary time series data, such as high dimensionality, the inability to identify lagged causal relationships, and the overlook of changing modules. Our approach identifies both lagged and instantaneous/contemporaneous causal relationships along with changing modules that vary over time. The method optimizes the conditioning sets in a constraint-based search by considering lagged parents instead of conditioning on the entire past that addresses high dimensionality. The changing modules are detected by considering both contemporaneous and lagged parents. The approach first detects the lagged adjacencies, then identifies the changing modules and contemporaneous adjacencies, and finally determines the causal direction. We extensively evaluated the proposed method using synthetic datasets and a real-world clinical dataset and compared its performance with several baseline approaches. The results demonstrate the effectiveness of the proposed method in detecting causal relationships and changing modules in autocorrelated and non-stationary time series data.  ( 2 min )
    Domain Adaptation for Time Series Under Feature and Label Shifts. (arXiv:2302.03133v1 [cs.LG])
    The transfer of models trained on labeled datasets in a source domain to unlabeled target domains is made possible by unsupervised domain adaptation (UDA). However, when dealing with complex time series models, the transferability becomes challenging due to the dynamic temporal structure that varies between domains, resulting in feature shifts and gaps in the time and frequency representations. Furthermore, tasks in the source and target domains can have vastly different label distributions, making it difficult for UDA to mitigate label shifts and recognize labels that only exist in the target domain. We present RAINCOAT, the first model for both closed-set and universal DA on complex time series. RAINCOAT addresses feature and label shifts by considering both temporal and frequency features, aligning them across domains, and correcting for misalignments to facilitate the detection of private labels. Additionally,RAINCOAT improves transferability by identifying label shifts in target domains. Our experiments with 5 datasets and 13 state-of-the-art UDA methods demonstrate that RAINCOAT can achieve an improvement in performance of up to 16.33%, and can effectively handle both closed-set and universal adaptation.  ( 2 min )
    Quantum Recurrent Neural Networks for Sequential Learning. (arXiv:2302.03244v1 [quant-ph])
    Quantum neural network (QNN) is one of the promising directions where the near-term noisy intermediate-scale quantum (NISQ) devices could find advantageous applications against classical resources. Recurrent neural networks are the most fundamental networks for sequential learning, but up to now there is still a lack of canonical model of quantum recurrent neural network (QRNN), which certainly restricts the research in the field of quantum deep learning. In the present work, we propose a new kind of QRNN which would be a good candidate as the canonical QRNN model, where, the quantum recurrent blocks (QRBs) are constructed in the hardware-efficient way, and the QRNN is built by stacking the QRBs in a staggered way that can greatly reduce the algorithm's requirement with regard to the coherent time of quantum devices. That is, our QRNN is much more accessible on NISQ devices. Furthermore, the performance of the present QRNN model is verified concretely using three different kinds of classical sequential data, i.e., meteorological indicators, stock price, and text categorization. The numerical experiments show that our QRNN achieves much better performance in prediction (classification) accuracy against the classical RNN and state-of-the-art QNN models for sequential learning, and can predict the changing details of temporal sequence data. The practical circuit structure and superior performance indicate that the present QRNN is a promising learning model to find quantum advantageous applications in the near term.  ( 2 min )
    DivBO: Diversity-aware CASH for Ensemble Learning. (arXiv:2302.03255v1 [cs.LG])
    The Combined Algorithm Selection and Hyperparameters optimization (CASH) problem is one of the fundamental problems in Automated Machine Learning (AutoML). Motivated by the success of ensemble learning, recent AutoML systems build post-hoc ensembles to output the final predictions instead of using the best single learner. However, while most CASH methods focus on searching for a single learner with the best performance, they neglect the diversity among base learners (i.e., they may suggest similar configurations to previously evaluated ones), which is also a crucial consideration when building an ensemble. To tackle this issue and further enhance the ensemble performance, we propose DivBO, a diversity-aware framework to inject explicit search of diversity into the CASH problems. In the framework, we propose to use a diversity surrogate to predict the pair-wise diversity of two unseen configurations. Furthermore, we introduce a temporary pool and a weighted acquisition function to guide the search of both performance and diversity based on Bayesian optimization. Empirical results on 15 public datasets show that DivBO achieves the best average ranks (1.82 and 1.73) on both validation and test errors among 10 compared methods, including post-hoc designs in recent AutoML systems and state-of-the-art baselines for ensemble learning on CASH problems.  ( 2 min )
    Exact Inference in High-order Structured Prediction. (arXiv:2302.03236v1 [cs.LG])
    In this paper, we study the problem of inference in high-order structured prediction tasks. In the context of Markov random fields, the goal of a high-order inference task is to maximize a score function on the space of labels, and the score function can be decomposed into sum of unary and high-order potentials. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also provide a new class of hypergraph structural properties related to hyperedge expansion that drives the success in general high-order inference problems. Finally, we connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.  ( 2 min )
    Continual Learning of Language Models. (arXiv:2302.03241v1 [cs.CL])
    Language models (LMs) have been instrumental for the rapid advance of natural language processing. This paper studies continual learning of LMs, in particular, continual domain-adaptive pre-training (or continual DAP-training). Existing research has shown that further pre-training an LM using a domain corpus to adapt the LM to the domain can improve the end-task performance in the domain. This paper proposes a novel method to continually DAP-train an LM with a sequence of unlabeled domain corpora to adapt the LM to these domains to improve their end-task performances. The key novelty of our method is a soft-masking mechanism that directly controls the update to the LM. A novel proxy is also proposed to preserve the general knowledge in the original LM. Additionally, it contrasts the representations of the previously learned domain knowledge (including the general knowledge in the pre-trained LM) and the knowledge from the current full network to achieve knowledge integration. The method not only overcomes catastrophic forgetting, but also achieves knowledge transfer to improve end-task performances. Empirical evaluation demonstrates the effectiveness of the proposed method.  ( 2 min )
    Easy Learning from Label Proportions. (arXiv:2302.03115v1 [cs.LG])
    We consider the problem of Learning from Label Proportions (LLP), a weakly supervised classification setup where instances are grouped into "bags", and only the frequency of class labels at each bag is available. Albeit, the objective of the learner is to achieve low task loss at an individual instance level. Here we propose Easyllp: a flexible and simple-to-implement debiasing approach based on aggregate labels, which operates on arbitrary loss functions. Our technique allows us to accurately estimate the expected loss of an arbitrary model at an individual level. We showcase the flexibility of our approach by applying it to popular learning frameworks, like Empirical Risk Minimization (ERM) and Stochastic Gradient Descent (SGD) with provable guarantees on instance level performance. More concretely, we exhibit a variance reduction technique that makes the quality of LLP learning deteriorate only by a factor of k (k being bag size) in both ERM and SGD setups, as compared to full supervision. Finally, we validate our theoretical results on multiple datasets demonstrating our algorithm performs as well or better than previous LLP approaches in spite of its simplicity.  ( 2 min )
    Linear optimal partial transport embedding. (arXiv:2302.03232v1 [cs.LG])
    Optimal transport (OT) has gained popularity due to its various applications in fields such as machine learning, statistics, and signal processing. However, the balanced mass requirement limits its performance in practical problems. To address these limitations, variants of the OT problem, including unbalanced OT, Optimal partial transport (OPT), and Hellinger Kantorovich (HK), have been proposed. In this paper, we propose the Linear optimal partial transport (LOPT) embedding, which extends the (local) linearization technique on OT and HK to the OPT problem. The proposed embedding allows for faster computation of OPT distance between pairs of positive measures. Besides our theoretical contributions, we demonstrate the LOPT embedding technique in point-cloud interpolation and PCA analysis.  ( 2 min )
    Genetic Programming Based Symbolic Regression for Analytical Solutions to Differential Equations. (arXiv:2302.03175v1 [cs.LG])
    In this paper, we present a machine learning method for the discovery of analytic solutions to differential equations. The method utilizes an inherently interpretable algorithm, genetic programming based symbolic regression. Unlike conventional accuracy measures in machine learning we demonstrate the ability to recover true analytic solutions, as opposed to a numerical approximation. The method is verified by assessing its ability to recover known analytic solutions for two separate differential equations. The developed method is compared to a conventional, purely data-driven genetic programming based symbolic regression algorithm. The reliability of successful evolution of the true solution, or an algebraic equivalent, is demonstrated.  ( 2 min )
    Towards Lightweight Cross-domain Sequential Recommendation via External Attention-enhanced Graph Convolution Network. (arXiv:2302.03221v1 [cs.IR])
    Cross-domain Sequential Recommendation (CSR) is an emerging yet challenging task that depicts the evolution of behavior patterns for overlapped users by modeling their interactions from multiple domains. Existing studies on CSR mainly focus on using composite or in-depth structures that achieve significant improvement in accuracy but bring a huge burden to the model training. Moreover, to learn the user-specific sequence representations, existing works usually adopt the global relevance weighting strategy (e.g., self-attention mechanism), which has quadratic computational complexity. In this work, we introduce a lightweight external attention-enhanced GCN-based framework to solve the above challenges, namely LEA-GCN. Specifically, by only keeping the neighborhood aggregation component and using the Single-Layer Aggregating Protocol (SLAP), our lightweight GCN encoder performs more efficiently to capture the collaborative filtering signals of the items from both domains. To further alleviate the framework structure and aggregate the user-specific sequential pattern, we devise a novel dual-channel External Attention (EA) component, which calculates the correlation among all items via a lightweight linear structure. Extensive experiments are conducted on two real-world datasets, demonstrating that LEA-GCN requires a smaller volume and less training time without affecting the accuracy compared with several state-of-the-art methods.  ( 2 min )
    Optimization using Parallel Gradient Evaluations on Multiple Parameters. (arXiv:2302.03161v1 [cs.LG])
    We propose a first-order method for convex optimization, where instead of being restricted to the gradient from a single parameter, gradients from multiple parameters can be used during each step of gradient descent. This setup is particularly useful when a few processors are available that can be used in parallel for optimization. Our method uses gradients from multiple parameters in synergy to update these parameters together towards the optima. While doing so, it is ensured that the computational and memory complexity is of the same order as that of gradient descent. Empirical results demonstrate that even using gradients from as low as \textit{two} parameters, our method can often obtain significant acceleration and provide robustness to hyper-parameter settings. We remark that the primary goal of this work is less theoretical, and is instead aimed at exploring the understudied case of using multiple gradients during each step of optimization.  ( 2 min )
    Protecting Language Generation Models via Invisible Watermarking. (arXiv:2302.03162v1 [cs.CR])
    Language generation models have been an increasingly powerful enabler for many applications. Many such models offer free or affordable API access, which makes them potentially vulnerable to model extraction attacks through distillation. To protect intellectual property (IP) and ensure fair use of these models, various techniques such as lexical watermarking and synonym replacement have been proposed. However, these methods can be nullified by obvious countermeasures such as "synonym randomization". To address this issue, we propose GINSEW, a novel method to protect text generation models from being stolen through distillation. The key idea of our method is to inject secret signals into the probability vector of the decoding steps for each target token. We can then detect the secret message by probing a suspect model to tell if it is distilled from the protected one. Experimental results show that GINSEW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs. Our method demonstrates an absolute improvement of 19 to 29 points on mean average precision (mAP) in detecting suspects compared to previous methods against watermark removal attacks.  ( 2 min )
    On the Convergence of Federated Averaging with Cyclic Client Participation. (arXiv:2302.03109v1 [cs.LG])
    Federated Averaging (FedAvg) and its variants are the most popular optimization algorithms in federated learning (FL). Previous convergence analyses of FedAvg either assume full client participation or partial client participation where the clients can be uniformly sampled. However, in practical cross-device FL systems, only a subset of clients that satisfy local criteria such as battery status, network connectivity, and maximum participation frequency requirements (to ensure privacy) are available for training at a given time. As a result, client availability follows a natural cyclic pattern. We provide (to our knowledge) the first theoretical framework to analyze the convergence of FedAvg with cyclic client participation with several different client optimizers such as GD, SGD, and shuffled SGD. Our analysis discovers that cyclic client participation can achieve a faster asymptotic convergence rate than vanilla FedAvg with uniform client participation under suitable conditions, providing valuable insights into the design of client sampling protocols.  ( 2 min )
    Exemplars and Counterexemplars Explanations for Image Classifiers, Targeting Skin Lesion Labeling. (arXiv:2302.03033v1 [eess.IV])
    Explainable AI consists in developing mechanisms allowing for an interaction between decision systems and humans by making the decisions of the formers understandable. This is particularly important in sensitive contexts like in the medical domain. We propose a use case study, for skin lesion diagnosis, illustrating how it is possible to provide the practitioner with explanations on the decisions of a state of the art deep neural network classifier trained to characterize skin lesions from examples. Our framework consists of a trained classifier onto which an explanation module operates. The latter is able to offer the practitioner exemplars and counterexemplars for the classification diagnosis thus allowing the physician to interact with the automatic diagnosis system. The exemplars are generated via an adversarial autoencoder. We illustrate the behavior of the system on representative examples.  ( 2 min )
    Learned Accelerator Framework for Angular-Distance-Based High-Dimensional DBSCAN. (arXiv:2302.03136v1 [cs.IR])
    Density-based clustering is a commonly used tool in data science. Today many data science works are utilizing high-dimensional neural embeddings. However, traditional density-based clustering techniques like DBSCAN have a degraded performance on high-dimensional data. In this paper, we propose LAF, a generic learned accelerator framework to speed up the original DBSCAN and the sampling-based variants of DBSCAN on high-dimensional data with angular distance metric. This framework consists of a learned cardinality estimator and a post-processing module. The cardinality estimator can fast predict whether a data point is core or not to skip unnecessary range queries, while the post-processing module detects the false negative predictions and merges the falsely separated clusters. The evaluation shows our LAF-enhanced DBSCAN method outperforms the state-of-the-art efficient DBSCAN variants on both efficiency and quality.  ( 2 min )
    Fair Minimum Representation Clustering. (arXiv:2302.03151v1 [cs.LG])
    Clustering is an unsupervised learning task that aims to partition data into a set of clusters. In many applications, these clusters correspond to real-world constructs (e.g. electoral districts) whose benefit can only be attained by groups when they reach a minimum level of representation (e.g. 50\% to elect their desired candidate). This paper considers the problem of performing k-means clustering while ensuring groups (e.g. demographic groups) have that minimum level of representation in a specified number of clusters. We show that the popular $k$-means algorithm, Lloyd's algorithm, can result in unfair outcomes where certain groups lack sufficient representation past the minimum threshold in a proportional number of clusters. We formulate the problem through a mixed-integer optimization framework and present a variant of Lloyd's algorithm, called MiniReL, that directly incorporates the fairness constraints. We show that incorporating the fairness criteria leads to a NP-Hard sub-problem within Lloyd's algorithm, but we provide computational approaches that make the problem tractable for even large datasets. Numerical results show that the approach is able to create fairer clusters with practically no increase in the k-means clustering cost across standard benchmark datasets.  ( 2 min )
    Predicting Development of Chronic Obstructive Pulmonary Disease and its Risk Factor Analysis. (arXiv:2302.03137v1 [q-bio.QM])
    Chronic Obstructive Pulmonary Disease (COPD) is an irreversible airway obstruction with a high societal burden. Although smoking is known to be the biggest risk factor, additional components need to be considered. In this study, we aim to identify COPD risk factors by applying machine learning models that integrate sociodemographic, clinical, and genetic data to predict COPD development.  ( 2 min )
    State-wise Safe Reinforcement Learning: A Survey. (arXiv:2302.03122v1 [cs.LG])
    Despite the tremendous success of Reinforcement Learning (RL) algorithms in simulation environments, applying RL to real-world applications still faces many challenges. A major concern is safety, in another word, constraint satisfaction. State-wise constraints are one of the most common constraints in real-world applications and one of the most challenging constraints in Safe RL. Enforcing state-wise constraints is necessary and essential to many challenging tasks such as autonomous driving, robot manipulation. This paper provides a comprehensive review of existing approaches that address state-wise constraints in RL. Under the framework of State-wise Constrained Markov Decision Process (SCMDP), we will discuss the connections, differences, and trade-offs of existing approaches in terms of (i) safety guarantee and scalability, (ii) safety and reward performance, and (iii) safety after convergence and during training. We also summarize limitations of current methods and discuss potential future directions.  ( 2 min )
    Importance attribution in neural networks by means of persistence landscapes of time series. (arXiv:2302.03132v1 [cs.LG])
    We propose and implement a method to analyze time series with a neural network using a matrix of area-normalized persistence landscapes obtained through topological data analysis. We include a gating layer in the network's architecture that is able to identify the most relevant landscape levels for the classification task, thus working as an importance attribution system. Next, we perform a matching between the selected landscape functions and the corresponding critical points of the original time series. From this matching we are able to reconstruct an approximate shape of the time series that gives insight into the classification decision. We test this technique with input data from a dataset of electrocardiographic signals.  ( 2 min )
    Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences. (arXiv:2302.03106v1 [cs.CL])
    Pre-trained language models have led to a new state-of-the-art in many NLP tasks. However, for topic modeling, statistical generative models such as LDA are still prevalent, which do not easily allow incorporating contextual word vectors. They might yield topics that do not align very well with human judgment. In this work, we propose a novel topic modeling and inference algorithm. We suggest a bag of sentences (BoS) approach using sentences as the unit of analysis. We leverage pre-trained sentence embeddings by combining generative process models with clustering. We derive a fast inference algorithm based on expectation maximization, hard assignments, and an annealing process. Our evaluation shows that our method yields state-of-the art results with relatively little computational demands. Our methods is more flexible compared to prior works leveraging word embeddings, since it provides the possibility to customize topic-document distributions using priors. Code is at \url{https://github.com/JohnTailor/BertSenClu}.  ( 2 min )
    One-shot Empirical Privacy Estimation for Federated Learning. (arXiv:2302.03098v1 [cs.LG])
    Privacy auditing techniques for differentially private (DP) algorithms are useful for estimating the privacy loss to compare against analytical bounds, or empirically measure privacy in settings where known analytical bounds on the DP loss are not tight. However, existing privacy auditing techniques usually make strong assumptions on the adversary (e.g., knowledge of intermediate model iterates or the training data distribution), are tailored to specific tasks and model architectures, and require retraining the model many times (typically on the order of thousands). These shortcomings make deploying such techniques at scale difficult in practice, especially in federated settings where model training can take days or weeks. In this work, we present a novel "one-shot" approach that can systematically address these challenges, allowing efficient auditing or estimation of the privacy loss of a model during the same, single training run used to fit model parameters. Our privacy auditing method for federated learning does not require a priori knowledge about the model architecture or task. We show that our method provides provably correct estimates for privacy loss under the Gaussian mechanism, and we demonstrate its performance on a well-established FL benchmark dataset under several adversarial models.  ( 2 min )
    Memory-Based Meta-Learning on Non-Stationary Distributions. (arXiv:2302.03067v1 [cs.LG])
    Memory-based meta-learning is a technique for approximating Bayes-optimal predictors. Under fairly general conditions, minimizing sequential prediction error, measured by the log loss, leads to implicit meta-learning. The goal of this work is to investigate how far this interpretation can be realized by current sequence prediction models and training regimes. The focus is on piecewise stationary sources with unobserved switching-points, which arguably capture an important characteristic of natural language and action-observation sequences in partially observable environments. We show that various types of memory-based neural models, including Transformers, LSTMs, and RNNs can learn to accurately approximate known Bayes-optimal algorithms and behave as if performing Bayesian inference over the latent switching-points and the latent parameters governing the data distribution within each segment.  ( 2 min )
    DITTO: Offline Imitation Learning with World Models. (arXiv:2302.03086v1 [cs.LG])
    We propose DITTO, an offline imitation learning algorithm which uses world models and on-policy reinforcement learning to addresses the problem of covariate shift, without access to an oracle or any additional online interactions. We discuss how world models enable offline, on-policy imitation learning, and propose a simple intrinsic reward defined in the world model latent space that induces imitation learning by reinforcement learning. Theoretically, we show that our formulation induces a divergence bound between expert and learner, in turn bounding the difference in reward. We test our method on difficult Atari environments from pixels alone, and achieve state-of-the-art performance in the offline setting.  ( 2 min )
    Five policy uses of algorithmic explainability. (arXiv:2302.03080v1 [cs.LG])
    The notion that algorithmic systems should be "explainable" is common in the many statements of consensus principles developed by governments, companies, and advocacy organizations. But what exactly do these policy and legal actors want from explainability, and how do their desiderata compare with the explainability techniques developed in the machine learning literature? We explore this question in hopes of better connecting the policy and technical communities. We outline five settings in which policymakers seek to use explainability: complying with specific requirements for explanation; helping to obtain regulatory approval in highly regulated settings; enabling or interfacing with liability; flexibly managing risk as part of a self-regulatory process; and providing model and data transparency. We illustrate each setting with an in-depth case study contextualizing the purpose and role of explanation. Drawing on these case studies, we discuss common factors limiting policymakers' use of explanation and promising ways in which explanation can be used in policy. We conclude with recommendations for researchers and policymakers.  ( 2 min )
    Evaluating Self-Supervised Learning via Risk Decomposition. (arXiv:2302.03068v1 [cs.LG])
    Self-supervised learning (SSL) pipelines differ in many design choices such as the architecture, augmentations, or pretraining data. Yet SSL is typically evaluated using a single metric: linear probing on ImageNet. This does not provide much insight into why or when a model is better, now how to improve it. To address this, we propose an SSL risk decomposition, which generalizes the classical supervised approximation-estimation decomposition by considering errors arising from the representation learning step. Our decomposition consists of four error components: approximation, representation usability, probe generalization, and encoder generalization. We provide efficient estimators for each component and use them to analyze the effect of 30 design choices on 169 SSL vision models evaluated on ImageNet. Our analysis gives valuable insights for designing and using SSL models. For example, it highlights the main sources of error and shows how to improve SSL in specific settings (full- vs few-shot) by trading off error components. All results and pretrained models are at https://github.com/YannDubs/SSL-Risk-Decomposition.  ( 2 min )
    Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Imputation. (arXiv:2302.03038v1 [q-bio.GN])
    Spatially resolved transcriptomics brings exciting breakthroughs to single-cell analysis by providing physical locations along with gene expression. However, as a cost of the extremely high spatial resolution, the cellular level spatial transcriptomic data suffer significantly from missing values. While a standard solution is to perform imputation on the missing values, most existing methods either overlook spatial information or only incorporate localized spatial context without the ability to capture long-range spatial information. Using multi-head self-attention mechanisms and positional encoding, transformer models can readily grasp the relationship between tokens and encode location information. In this paper, by treating single cells as spatial tokens, we study how to leverage transformers to facilitate spatial tanscriptomics imputation. In particular, investigate the following two key questions: (1) $\textit{how to encode spatial information of cells in transformers}$, and (2) $\textit{ how to train a transformer for transcriptomic imputation}$. By answering these two questions, we present a transformer-based imputation framework, SpaFormer, for cellular-level spatial transcriptomic data. Extensive experiments demonstrate that SpaFormer outperforms existing state-of-the-art imputation algorithms on three large-scale datasets.  ( 2 min )
    LiteVR: Interpretable and Lightweight Cybersickness Detection using Explainable AI. (arXiv:2302.03037v1 [cs.HC])
    Cybersickness is a common ailment associated with virtual reality (VR) user experiences. Several automated methods exist based on machine learning (ML) and deep learning (DL) to detect cybersickness. However, most of these cybersickness detection methods are perceived as computationally intensive and black-box methods. Thus, those techniques are neither trustworthy nor practical for deploying on standalone energy-constrained VR head-mounted devices (HMDs). In this work, we present an explainable artificial intelligence (XAI)-based framework, LiteVR, for cybersickness detection, explaining the model's outcome and reducing the feature dimensions and overall computational costs. First, we develop three cybersickness DL models based on long-term short-term memory (LSTM), gated recurrent unit (GRU), and multilayer perceptron (MLP). Then, we employed a post-hoc explanation, such as SHapley Additive Explanations (SHAP), to explain the results and extract the most dominant features of cybersickness. Finally, we retrain the DL models with the reduced number of features. Our results show that eye-tracking features are the most dominant for cybersickness detection. Furthermore, based on the XAI-based feature ranking and dimensionality reduction, we significantly reduce the model's size by up to 4.3x, training time by up to 5.6x, and its inference time by up to 3.8x, with higher cybersickness detection accuracy and low regression error (i.e., on Fast Motion Scale (FMS)). Our proposed lite LSTM model obtained an accuracy of 94% in classifying cybersickness and regressing (i.e., FMS 1-10) with a Root Mean Square Error (RMSE) of 0.30, which outperforms the state-of-the-art. Our proposed LiteVR framework can help researchers and practitioners analyze, detect, and deploy their DL-based cybersickness detection models in standalone VR HMDs.  ( 2 min )
  • Open

    Understanding Why Generalized Reweighting Does Not Improve Over ERM. (arXiv:2201.12293v4 [cs.LG] UPDATED)
    Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.  ( 2 min )
    Generalization Bounds of Nonconvex-(Strongly)-Concave Stochastic Minimax Optimization. (arXiv:2205.14278v2 [math.OC] UPDATED)
    This paper takes an initial step to systematically investigate the generalization bounds of algorithms for solving nonconvex-(strongly)-concave (NC-SC/NC-C) stochastic minimax optimization measured by the stationarity of primal functions. We first establish algorithm-agnostic generalization bounds via uniform convergence between the empirical minimax problem and the population minimax problem. The sample complexities for achieving $\epsilon$-generalization are $\tilde{\mathcal{O}}(d\kappa^2\epsilon^{-2})$ and $\tilde{\mathcal{O}}(d\epsilon^{-4})$ for NC-SC and NC-C settings, respectively, where $d$ is the dimension and $\kappa$ is the condition number. We further study the algorithm-dependent generalization bounds via stability arguments of algorithms. In particular, we introduce a novel stability notion for minimax problems and build a connection between generalization bounds and the stability notion. As a result, we establish algorithm-dependent generalization bounds for stochastic gradient descent ascent (SGDA) algorithm and the more general sampling-determined algorithms.  ( 2 min )
    A nonparametric extension of randomized response for private confidence sets. (arXiv:2202.08728v2 [stat.ME] UPDATED)
    This work derives methods for performing nonparametric, nonasymptotic statistical inference for population means under the constraint of local differential privacy (LDP). Given bounded observations $(X_1, \dots, X_n)$ with mean $\mu^\star$ that are privatized into $(Z_1, \dots, Z_n)$, we present confidence intervals (CI) and time-uniform confidence sequences (CS) for $\mu^\star$ when only given access to the privatized data. To achieve this, we introduce a nonparametric and sequentially interactive generalization of Warner's famous ``randomized response'' mechanism, satisfying LDP for arbitrary bounded random variables, and then provide CIs and CSs for their means given access to the resulting privatized observations. For example, our results yield private analogues of Hoeffding's inequality in both fixed-time and time-uniform regimes. We extend these Hoeffding-type CSs to capture time-varying (non-stationary) means, and conclude by illustrating how these methods can be used to conduct private online A/B tests.  ( 2 min )
    Revised Conditional t-SNE: Looking Beyond the Nearest Neighbors. (arXiv:2302.03493v1 [cs.LG])
    Conditional t-SNE (ct-SNE) is a recent extension to t-SNE that allows removal of known cluster information from the embedding, to obtain a visualization revealing structure beyond label information. This is useful, for example, when one wants to factor out unwanted differences between a set of classes. We show that ct-SNE fails in many realistic settings, namely if the data is well clustered over the labels in the original high-dimensional space. We introduce a revised method by conditioning the high-dimensional similarities instead of the low-dimensional similarities and storing within- and across-label nearest neighbors separately. This also enables the use of recently proposed speedups for t-SNE, improving the scalability. From experiments on synthetic data, we find that our proposed method resolves the considered problems and improves the embedding quality. On real data containing batch effects, the expected improvement is not always there. We argue revised ct-SNE is preferable overall, given its improved scalability. The results also highlight new open questions, such as how to handle distance variations between clusters.  ( 2 min )
    Iterated Block Particle Filter for High-dimensional Parameter Learning: Beating the Curse of Dimensionality. (arXiv:2110.10745v3 [stat.ML] UPDATED)
    Parameter learning for high-dimensional, partially observed, and nonlinear stochastic processes is a methodological challenge. Spatiotemporal disease transmission systems provide examples of such processes giving rise to open inference problems. We propose the iterated block particle filter (IBPF) algorithm for learning high-dimensional parameters over graphical state space models with general state spaces, measures, transition densities and graph structure. Theoretical performance guarantees are obtained on beating the curse of dimensionality (COD), algorithm convergence, and likelihood maximization. Experiments on a highly nonlinear and non-Gaussian spatiotemporal model for measles transmission reveal that the iterated ensemble Kalman filter algorithm (Li et al. (2020)) is ineffective and the iterated filtering algorithm (Ionides et al. (2015)) suffers from the COD, while our IBPF algorithm beats COD consistently across various experiments with different metrics.  ( 2 min )
    On the symmetries in the dynamics of wide two-layer neural networks. (arXiv:2211.08771v3 [cs.LG] UPDATED)
    We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $f^*$ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $f^*$ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $f^*$ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.  ( 2 min )
    Accelerated Nonnegative Tensor Completion via Integer Programming. (arXiv:2211.15770v2 [cs.LG] UPDATED)
    The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices.  ( 2 min )
    Deep-OSG: A deep learning approach for approximating a family of operators in semigroup to model unknown autonomous systems. (arXiv:2302.03358v1 [cs.LG])
    This paper proposes a novel deep learning approach for approximating evolution operators and modeling unknown autonomous dynamical systems using time series data collected at varied time lags. It is a sequel to the previous works [T. Qin, K. Wu, and D. Xiu, J. Comput. Phys., 395:620--635, 2019], [K. Wu and D. Xiu, J. Comput. Phys., 408:109307, 2020], and [Z. Chen, V. Churchill, K. Wu, and D. Xiu, J. Comput. Phys., 449:110782, 2022], which focused on learning single evolution operator with a fixed time step. This paper aims to learn a family of evolution operators with variable time steps, which constitute a semigroup for an autonomous system. The semigroup property is very crucial and links the system's evolutionary behaviors across varying time scales, but it was not considered in the previous works. We propose for the first time a framework of embedding the semigroup property into the data-driven learning process, through a novel neural network architecture and new loss functions. The framework is very feasible, can be combined with any suitable neural networks, and is applicable to learning general autonomous ODEs and PDEs. We present the rigorous error estimates and variance analysis to understand the prediction accuracy and robustness of our approach, showing the remarkable advantages of semigroup awareness in our model. Moreover, our approach allows one to arbitrarily choose the time steps for prediction and ensures that the predicted results are well self-matched and consistent. Extensive numerical experiments demonstrate that embedding the semigroup property notably reduces the data dependency of deep learning models and greatly improves the accuracy, robustness, and stability for long-time prediction.  ( 2 min )
    Temporal Robustness against Data Poisoning. (arXiv:2302.03684v1 [cs.LG])
    Data poisoning considers cases when an adversary maliciously inserts and removes training data to manipulate the behavior of machine learning algorithms. Traditional threat models of data poisoning center around a single metric, the number of poisoned samples. In consequence, existing defenses are essentially vulnerable in practice when poisoning more samples remains a feasible option for attackers. To address this issue, we leverage timestamps denoting the birth dates of data, which are often available but neglected in the past. Benefiting from these timestamps, we propose a temporal threat model of data poisoning and derive two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted. With these metrics, we define the notions of temporal robustness against data poisoning, providing a meaningful sense of protection even with unbounded amounts of poisoned samples. We present a benchmark with an evaluation protocol simulating continuous data collection and periodic deployments of updated models, thus enabling empirical evaluation of temporal robustness. Lastly, we develop and also empirically verify a baseline defense, namely temporal aggregation, offering provable temporal robustness and highlighting the potential of our temporal modeling of data poisoning.  ( 2 min )
    Revisiting Discriminative vs. Generative Classifiers: Theory and Implications. (arXiv:2302.02334v1 [cs.LG] CROSS LISTED)
    A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the ``two regimes'' phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.  ( 2 min )
    Representation Theory for Geometric Quantum Machine Learning. (arXiv:2210.07980v2 [quant-ph] UPDATED)
    Recent advances in classical machine learning have shown that creating models with inductive biases encoding the symmetries of a problem can greatly improve performance. Importation of these ideas, combined with an existing rich body of work at the nexus of quantum theory and symmetry, has given rise to the field of Geometric Quantum Machine Learning (GQML). Following the success of its classical counterpart, it is reasonable to expect that GQML will play a crucial role in developing problem-specific and quantum-aware models capable of achieving a computational advantage. Despite the simplicity of the main idea of GQML -- create architectures respecting the symmetries of the data -- its practical implementation requires a significant amount of knowledge of group representation theory. We present an introduction to representation theory tools from the optics of quantum learning, driven by key examples involving discrete and continuous groups. These examples are sewn together by an exposition outlining the formal capture of GQML symmetries via "label invariance under the action of a group representation", a brief (but rigorous) tour through finite and compact Lie group representation theory, a reexamination of ubiquitous tools like Haar integration and twirling, and an overview of some successful strategies for detecting symmetries.  ( 2 min )
    Matrix Estimation for Individual Fairness. (arXiv:2302.02096v1 [cs.LG] CROSS LISTED)
    In recent years, multiple notions of algorithmic fairness have arisen. One such notion is individual fairness (IF), which requires that individuals who are similar receive similar treatment. In parallel, matrix estimation (ME) has emerged as a natural paradigm for handling noisy data with missing values. In this work, we connect the two concepts. We show that pre-processing data using ME can improve an algorithm's IF without sacrificing performance. Specifically, we show that using a popular ME method known as singular value thresholding (SVT) to pre-process the data provides a strong IF guarantee under appropriate conditions. We then show that, under analogous conditions, SVT pre-processing also yields estimates that are consistent and approximately minimax optimal. As such, the ME pre-processing step does not, under the stated conditions, increase the prediction error of the base algorithm, i.e., does not impose a fairness-performance trade-off. We verify these results on synthetic and real data.  ( 2 min )
    Analyzing Tree Architectures in Ensembles via Neural Tangent Kernel. (arXiv:2205.12904v2 [cs.LG] UPDATED)
    A soft tree is an actively studied variant of a decision tree that updates splitting rules using the gradient method. Although soft trees can take various architectures, their impact is not theoretically well known. In this paper, we formulate and analyze the Neural Tangent Kernel (NTK) induced by soft tree ensembles for arbitrary tree architectures. This kernel leads to the remarkable finding that only the number of leaves at each depth is relevant for the tree architecture in ensemble learning with an infinite number of trees. In other words, if the number of leaves at each depth is fixed, the training behavior in function space and the generalization performance are exactly the same across different tree architectures, even if they are not isomorphic. We also show that the NTK of asymmetric trees like decision lists does not degenerate when they get infinitely deep. This is in contrast to the perfect binary trees, whose NTK is known to degenerate and leads to worse generalization performance for deeper trees.  ( 2 min )
    Deep Linear Networks can Benignly Overfit when Shallow Ones Do. (arXiv:2209.09315v2 [cs.LG] UPDATED)
    We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $\ell_2$-norm solution. Since the noise affects the excess risk only through the conditional variance, this implies that depth does not improve the algorithm's ability to "hide the noise". Our simulations verify that aspects of our bounds reflect typical behavior for simple data distributions. We also find that similar phenomena are seen in simulations with ReLU networks, although the situation there is more nuanced.  ( 2 min )
    Variance-Aware Sparse Linear Bandits. (arXiv:2205.13450v3 [cs.LG] UPDATED)
    It is well-known that for sparse linear bandits, when ignoring the dependency on sparsity which is much smaller than the ambient dimension, the worst-case minimax regret is $\widetilde{\Theta}\left(\sqrt{dT}\right)$ where $d$ is the ambient dimension and $T$ is the number of rounds. On the other hand, in the benign setting where there is no noise and the action set is the unit sphere, one can use divide-and-conquer to achieve $\widetilde{\mathcal O}(1)$ regret, which is (nearly) independent of $d$ and $T$. In this paper, we present the first variance-aware regret guarantee for sparse linear bandits: $\widetilde{\mathcal O}\left(\sqrt{d\sum_{t=1}^T \sigma_t^2} + 1\right)$, where $\sigma_t^2$ is the variance of the noise at the $t$-th round. This bound naturally interpolates the regret bounds for the worst-case constant-variance regime (i.e., $\sigma_t \equiv \Omega(1)$) and the benign deterministic regimes (i.e., $\sigma_t \equiv 0$). To achieve this variance-aware regret guarantee, we develop a general framework that converts any variance-aware linear bandit algorithm to a variance-aware algorithm for sparse linear bandits in a "black-box" manner. Specifically, we take two recent algorithms as black boxes to illustrate that the claimed bounds indeed hold, where the first algorithm can handle unknown-variance cases and the second one is more efficient.  ( 2 min )
    Incentive-aware Contextual Pricing with Non-parametric Market Noise. (arXiv:1911.03508v3 [cs.LG] UPDATED)
    We consider a dynamic pricing problem for repeated contextual second-price auctions with multiple strategic buyers who aim to maximize their long-term time discounted utility. The seller has limited information on buyers' overall demand curves which depends on a non-parametric market-noise distribution, and buyers may potentially submit corrupted bids (relative to true valuations) to manipulate the seller's pricing policy for more favorable reserve prices in the future. We focus on designing the seller's learning policy to set contextual reserve prices where the seller's goal is to minimize regret compared to the revenue of a benchmark clairvoyant policy that has full information of buyers' demand. We propose a policy with a phased-structure that incorporates randomized "isolation" periods, during which a buyer is randomly chosen to solely participate in the auction. We show that this design allows the seller to control the number of periods in which buyers significantly corrupt their bids. We then prove that our policy enjoys a $T$-period regret of $\widetilde{\mathcal{O}}(\sqrt{T})$ facing strategic buyers. Finally, we conduct numerical simulations to compare our proposed algorithm to standard pricing policies. Our numerical results show that our algorithm outperforms these policies under various buyer bidding behavior.  ( 2 min )
    Linear Partial Monitoring for Sequential Decision-Making: Algorithms, Regret Bounds and Applications. (arXiv:2302.03683v1 [cs.LG])
    Partial monitoring is an expressive framework for sequential decision-making with an abundance of applications, including graph-structured and dueling bandits, dynamic pricing and transductive feedback models. We survey and extend recent results on the linear formulation of partial monitoring that naturally generalizes the standard linear bandit setting. The main result is that a single algorithm, information-directed sampling (IDS), is (nearly) worst-case rate optimal in all finite-action games. We present a simple and unified analysis of stochastic partial monitoring, and further extend the model to the contextual and kernelized setting.  ( 2 min )
    A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data. (arXiv:2302.03157v1 [stat.ME])
    We create a mixed-integer optimization (MIO) approach for doing cluster-aware regression, i.e. linear regression that takes into account the inherent clustered structure of the data. We compare to the linear mixed effects regression (LMEM) which is the most used current method, and design simulation experiments to show superior performance to LMEM in terms of both predictive and inferential metrics in silico. Furthermore, we show how our method is formulated in a very interpretable way; LMEM cannot generalize and make cluster-informed predictions when the cluster of new data points is unknown, but we solve this problem by training an interpretable classification tree that can help decide cluster effects for new data points, and demonstrate the power of this generalizability on a real protein expression dataset.  ( 2 min )
    Identification of Power System Oscillation Modes using Blind Source Separation based on Copula Statistic. (arXiv:2302.03633v1 [eess.SP])
    The dynamics of a power system with large penetration of renewable energy resources are becoming more nonlinear due to the intermittence of these resources and the switching of their power electronic devices. Therefore, it is crucial to accurately identify the dynamical modes of oscillation of such a power system when it is subject to disturbances to initiate appropriate preventive or corrective control actions. In this paper, we propose a high-order blind source identification (HOBI) algorithm based on the copula statistic to address these non-linear dynamics in modal analysis. The method combined with Hilbert transform (HOBI-HT) and iteration procedure (HOBMI) can identify all the modes as well as the model order from the observation signals obtained from the number of channels as low as one. We access the performance of the proposed method on numerical simulation signals and recorded data from a simulation of time domain analysis on the classical 11-Bus 4-Machine test system. Our simulation results outperform the state-of-the-art method in accuracy and effectiveness.  ( 2 min )
    Optimizing Audio Recommendations for the Long-Term: A Reinforcement Learning Perspective. (arXiv:2302.03561v1 [cs.LG])
    We study the problem of optimizing a recommender system for outcomes that occur over several weeks or months. We begin by drawing on reinforcement learning to formulate a comprehensive model of users' recurring relationships with a recommender system. Measurement, attribution, and coordination challenges complicate algorithm design. We describe careful modeling -- including a new representation of user state and key conditional independence assumptions -- which overcomes these challenges and leads to simple, testable recommender system prototypes. We apply our approach to a podcast recommender system that makes personalized recommendations to hundreds of millions of listeners. A/B tests demonstrate that purposefully optimizing for long-term outcomes leads to large performance gains over conventional approaches that optimize for short-term proxies.  ( 2 min )
    Evaluating Self-Supervised Learning via Risk Decomposition. (arXiv:2302.03068v1 [cs.LG])
    Self-supervised learning (SSL) pipelines differ in many design choices such as the architecture, augmentations, or pretraining data. Yet SSL is typically evaluated using a single metric: linear probing on ImageNet. This does not provide much insight into why or when a model is better, now how to improve it. To address this, we propose an SSL risk decomposition, which generalizes the classical supervised approximation-estimation decomposition by considering errors arising from the representation learning step. Our decomposition consists of four error components: approximation, representation usability, probe generalization, and encoder generalization. We provide efficient estimators for each component and use them to analyze the effect of 30 design choices on 169 SSL vision models evaluated on ImageNet. Our analysis gives valuable insights for designing and using SSL models. For example, it highlights the main sources of error and shows how to improve SSL in specific settings (full- vs few-shot) by trading off error components. All results and pretrained models are at https://github.com/YannDubs/SSL-Risk-Decomposition.  ( 2 min )
    Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation. (arXiv:2302.03673v1 [cs.LG])
    We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters.  ( 3 min )
    Approximate message passing from random initialization with applications to $\mathbb{Z}_{2}$ synchronization. (arXiv:2302.03682v1 [math.ST])
    This paper is concerned with the problem of reconstructing an unknown rank-one matrix with prior structural information from noisy observations. While computing the Bayes-optimal estimator seems intractable in general due to its nonconvex nature, Approximate Message Passing (AMP) emerges as an efficient first-order method to approximate the Bayes-optimal estimator. However, the theoretical underpinnings of AMP remain largely unavailable when it starts from random initialization, a scheme of critical practical utility. Focusing on a prototypical model called $\mathbb{Z}_{2}$ synchronization, we characterize the finite-sample dynamics of AMP from random initialization, uncovering its rapid global convergence. Our theory provides the first non-asymptotic characterization of AMP in this model without requiring either an informative initialization (e.g., spectral initialization) or sample splitting.  ( 2 min )
    SDYN-GANs: Adversarial Learning Methods for Multistep Generative Models for General Order Stochastic Dynamics. (arXiv:2302.03663v1 [cs.LG])
    We introduce adversarial learning methods for data-driven generative modeling of the dynamics of $n^{th}$-order stochastic systems. Our approach builds on Generative Adversarial Networks (GANs) with generative model classes based on stable $m$-step stochastic numerical integrators. We introduce different formulations and training methods for learning models of stochastic dynamics based on observation of trajectory samples. We develop approaches using discriminators based on Maximum Mean Discrepancy (MMD), training protocols using conditional and marginal distributions, and methods for learning dynamic responses over different time-scales. We show how our approaches can be used for modeling physical systems to learn force-laws, damping coefficients, and noise-related parameters. The adversarial learning approaches provide methods for obtaining stable generative models for dynamic tasks including long-time prediction and developing simulations for stochastic systems.  ( 2 min )
    Riemannian Flow Matching on General Geometries. (arXiv:2302.03660v1 [cs.LG])
    We propose Riemannian Flow Matching (RFM), a simple yet powerful framework for training continuous normalizing flows on manifolds. Existing methods for generative modeling on manifolds either require expensive simulation, inherently cannot scale to high dimensions, or use approximations to limiting quantities that result in biased objectives. Riemannian Flow Matching bypasses these inconveniences and exhibits multiple benefits over prior approaches: It is completely simulation-free on simple geometries, it does not require divergence computation, and its target vector field is computed in closed form even on general geometries. The key ingredient behind RFM is the construction of a simple kernel function for defining per-sample vector fields, which subsumes existing Euclidean cases. Extending to general geometries, we rely on the use of spectral decompositions to efficiently compute kernel functions. Our method achieves state-of-the-art performance on real-world non-Euclidean datasets, and we showcase, for the first time, tractable training on general geometries, including on triangular meshes and maze-like manifolds with boundaries.  ( 2 min )
    Multi-Scale Message Passing Neural PDE Solvers. (arXiv:2302.03580v1 [cs.LG])
    We propose a novel multi-scale message passing neural network algorithm for learning the solutions of time-dependent PDEs. Our algorithm possesses both temporal and spatial multi-scale resolution features by incorporating multi-scale sequence models and graph gating modules in the encoder and processor, respectively. Benchmark numerical experiments are presented to demonstrate that the proposed algorithm outperforms baselines, particularly on a PDE with a range of spatial and temporal scales.  ( 2 min )
    Convergence rates for momentum stochastic gradient descent with noise of machine learning type. (arXiv:2302.03550v1 [math.OC])
    We consider the momentum stochastic gradient descent scheme (MSGD) and its continuous-in-time counterpart in the context of non-convex optimization. We show almost sure exponential convergence of the objective function value for target functions that are Lipschitz continuous and satisfy the Polyak-Lojasiewicz inequality on the relevant domain, and under assumptions on the stochastic noise that are motivated by overparameterized supervised learning applications. Moreover, we optimize the convergence rate over the set of friction parameters and show that the MSGD process almost surely converges.  ( 2 min )
    Efficient Parametric Approximations of Neural Network Function Space Distance. (arXiv:2302.03519v1 [cs.LG])
    It is often useful to compactly summarize important properties of model parameters and training data so that they can be used later without storing and/or iterating over the entire dataset. As a specific case, we consider estimating the Function Space Distance (FSD) over a training set, i.e. the average discrepancy between the outputs of two neural networks. We propose a Linearized Activation Function TRick (LAFTR) and derive an efficient approximation to FSD for ReLU neural networks. The key idea is to approximate the architecture as a linear network with stochastic gating. Despite requiring only one parameter per unit of the network, our approach outcompetes other parametric approximations with larger memory requirements. Applied to continual learning, our parametric approximation is competitive with state-of-the-art nonparametric approximations, which require storing many training examples. Furthermore, we show its efficacy in estimating influence functions accurately and detecting mislabeled examples without expensive iterations over the entire dataset.  ( 2 min )
    OPORP: One Permutation + One Random Projection. (arXiv:2302.03505v1 [stat.ML])
    Consider two $D$-dimensional data vectors (e.g., embeddings): $u, v$. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, $D=256\sim 1024$ are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector $r$ is generated i.i.d. with moments: $E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s$. We multiply (as dot product) $r$ with all permuted data vectors. Then we break the $D$ columns into $k$ equal-length bins and aggregate (i.e., sum) the values in each bin to obtain $k$ samples from each data vector. One crucial step is to normalize the $k$ samples to the unit $l_2$ norm. We show that the estimation variance is essentially: $(s-1)A + \frac{D-k}{D-1}\frac{1}{k}\left[ (1-\rho^2)^2 -2A\right]$, where $A\geq 0$ is a function of the data ($u,v$). This formula reveals several key properties: (1) We need $s=1$. (2) The factor $\frac{D-k}{D-1}$ can be highly beneficial in reducing variances. (3) The term $\frac{1}{k}(1-\rho^2)^2$ is actually the asymptotic variance of the classical correlation estimator. We illustrate that by letting the $k$ in OPORP to be $k=1$ and repeat the procedure $m$ times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.  ( 2 min )
    On the relationship between multivariate splines and infinitely-wide neural networks. (arXiv:2302.03459v1 [cs.LG])
    We consider multivariate splines and show that they have a random feature expansion as infinitely wide neural networks with one-hidden layer and a homogeneous activation function which is the power of the rectified linear unit. We show that the associated function space is a Sobolev space on a Euclidean ball, with an explicit bound on the norms of derivatives. This link provides a new random feature expansion for multivariate splines that allow efficient algorithms. This random feature expansion is numerically better behaved than usual random Fourier features, both in theory and practice. In particular, in dimension one, we compare the associated leverage scores to compare the two random expansions and show a better scaling for the neural network expansion.  ( 2 min )
    Sparse GEMINI for Joint Discriminative Clustering and Feature Selection. (arXiv:2302.03391v1 [stat.ML])
    Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on $p(\pmb{x})$, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple $\ell_1$ penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a clustering model $p_\theta(y|\pmb{x})$. We demonstrate the performances of Sparse GEMINI on synthetic datasets as well as large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.  ( 2 min )
    Federated Variational Inference Methods for Structured Latent Variable Models. (arXiv:2302.03314v1 [stat.ML])
    Federated learning methods, that is, methods that perform model training using data situated across different sources, whilst simultaneously not having the data leave their original source, are of increasing interest in a number of fields. However, despite this interest, the classes of models for which easily-applicable and sufficiently general approaches are available is limited, excluding many structured probabilistic models. We present a general yet elegant resolution to the aforementioned issue. The approach is based on adopting structured variational inference, an approach widely used in Bayesian machine learning, to the federated setting. Additionally, a communication-efficient variant analogous to the canonical FedAvg algorithm is explored. The effectiveness of the proposed algorithms are demonstrated, and their performance is compared on Bayesian multinomial regression, topic modelling, and mixed model examples.  ( 2 min )
    A unified recipe for deriving (time-uniform) PAC-Bayes bounds. (arXiv:2302.03421v1 [stat.ML])
    We present a unified framework for deriving PAC-Bayesian generalization bounds. Unlike most previous literature on this topic, our bounds are anytime-valid (i.e., time-uniform), meaning that they hold at all stopping times, not only for a fixed sample size. Our approach combines four tools in the following order: (a) nonnegative supermartingales or reverse submartingales, (b) the method of mixtures, (c) the Donsker-Varadhan formula (or other convex duality principles), and (d) Ville's inequality. We derive time-uniform generalizations of well-known classical PAC-Bayes bounds, such as those of Seeger, McAllester, Maurer, and Catoni, in addition to many recent bounds. We also present several novel bounds and, more importantly, general techniques for constructing them. Despite being anytime-valid, our extensions remain as tight as their fixed-time counterparts. Moreover, they enable us to relax traditional assumptions; in particular, we consider nonstationary loss functions and non-i.i.d. data. In sum, we unify the derivation of past bounds and ease the search for future bounds: one may simply check if our supermartingale or submartingale conditions are met and, if so, be guaranteed a (time-uniform) PAC-Bayes bound.  ( 2 min )
    Leveraging Demonstrations to Improve Online Learning: Quality Matters. (arXiv:2302.03319v1 [cs.LG])
    We investigate the extent to which offline demonstration data can improve online learning. It is natural to expect some improvement, but the question is how, and by how much? We show that the degree of improvement must depend on the quality of the demonstration data. To generate portable insights, we focus on Thompson sampling (TS) applied to a multi-armed bandit as a prototypical online learning algorithm and model. The demonstration data is generated by an expert with a given competence level, a notion we introduce. We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule and derive a prior-dependent Bayesian regret bound. This offers insight into how pretraining can greatly improve online performance and how the degree of improvement increases with the expert's competence level. We also develop a practical, approximate informed TS algorithm through Bayesian bootstrapping and show substantial empirical regret reduction through experiments.  ( 2 min )
    Algorithmically Designed Artificial Neural Networks (ADANNs): Higher order deep operator learning for parametric partial differential equations. (arXiv:2302.03286v1 [math.NA])
    In this article we propose a new deep learning approach to solve parametric partial differential equations (PDEs) approximately. In particular, we introduce a new strategy to design specific artificial neural network (ANN) architectures in conjunction with specific ANN initialization schemes which are tailor-made for the particular scientific computing approximation problem under consideration. In the proposed approach we combine efficient classical numerical approximation techniques such as higher-order Runge-Kutta schemes with sophisticated deep (operator) learning methodologies such as the recently introduced Fourier neural operators (FNOs). Specifically, we introduce customized adaptions of existing standard ANN architectures together with specialized initializations for these ANN architectures so that at initialization we have that the ANNs closely mimic a chosen efficient classical numerical algorithm for the considered approximation problem. The obtained ANN architectures and their initialization schemes are thus strongly inspired by numerical algorithms as well as by popular deep learning methodologies from the literature and in that sense we refer to the introduced ANNs in conjunction with their tailor-made initialization schemes as Algorithmically Designed Artificial Neural Networks (ADANNs). We numerically test the proposed ADANN approach in the case of some parametric PDEs. In the tested numerical examples the ADANN approach significantly outperforms existing traditional approximation algorithms as well as existing deep learning methodologies from the literature.  ( 2 min )
    Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR. (arXiv:2302.03201v1 [cs.LG])
    In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $\tau$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret rate is $\Omega(\sqrt{\tau^{-1}AK})$, where $A$ is the number of actions and $K$ is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of $\Omega(\sqrt{\tau^{-1}SAK})$ (with normalized cumulative rewards), where $S$ is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of $\widetilde O(\sqrt{\tau^{-1}SAK})$ under a continuity assumption and in general attains a near-optimal regret of $\widetilde O(\tau^{-1}\sqrt{SAK})$, which is minimax-optimal for constant $\tau$. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.  ( 2 min )
    Exact Inference in High-order Structured Prediction. (arXiv:2302.03236v1 [cs.LG])
    In this paper, we study the problem of inference in high-order structured prediction tasks. In the context of Markov random fields, the goal of a high-order inference task is to maximize a score function on the space of labels, and the score function can be decomposed into sum of unary and high-order potentials. We apply a generative model approach to study the problem of high-order inference, and provide a two-stage convex optimization algorithm for exact label recovery. We also provide a new class of hypergraph structural properties related to hyperedge expansion that drives the success in general high-order inference problems. Finally, we connect the performance of our algorithm and the hyperedge expansion property using a novel hypergraph Cheeger-type inequality.  ( 2 min )
    Easy Learning from Label Proportions. (arXiv:2302.03115v1 [cs.LG])
    We consider the problem of Learning from Label Proportions (LLP), a weakly supervised classification setup where instances are grouped into "bags", and only the frequency of class labels at each bag is available. Albeit, the objective of the learner is to achieve low task loss at an individual instance level. Here we propose Easyllp: a flexible and simple-to-implement debiasing approach based on aggregate labels, which operates on arbitrary loss functions. Our technique allows us to accurately estimate the expected loss of an arbitrary model at an individual level. We showcase the flexibility of our approach by applying it to popular learning frameworks, like Empirical Risk Minimization (ERM) and Stochastic Gradient Descent (SGD) with provable guarantees on instance level performance. More concretely, we exhibit a variance reduction technique that makes the quality of LLP learning deteriorate only by a factor of k (k being bag size) in both ERM and SGD setups, as compared to full supervision. Finally, we validate our theoretical results on multiple datasets demonstrating our algorithm performs as well or better than previous LLP approaches in spite of its simplicity.  ( 2 min )
    Memory-Based Meta-Learning on Non-Stationary Distributions. (arXiv:2302.03067v1 [cs.LG])
    Memory-based meta-learning is a technique for approximating Bayes-optimal predictors. Under fairly general conditions, minimizing sequential prediction error, measured by the log loss, leads to implicit meta-learning. The goal of this work is to investigate how far this interpretation can be realized by current sequence prediction models and training regimes. The focus is on piecewise stationary sources with unobserved switching-points, which arguably capture an important characteristic of natural language and action-observation sequences in partially observable environments. We show that various types of memory-based neural models, including Transformers, LSTMs, and RNNs can learn to accurately approximate known Bayes-optimal algorithms and behave as if performing Bayesian inference over the latent switching-points and the latent parameters governing the data distribution within each segment.  ( 2 min )

  • Open

    I want to train a CNN to perform Stool analysis - where can I find the datatset?
    ^ submitted by /u/Ok_Gur_2696 [link] [comments]  ( 40 min )
    Best Books to Learn Neural Networks in 2023 for Beginners
    submitted by /u/Lakshmireddys [link] [comments]  ( 40 min )
    Understanding Attention Mechanism in Transformer Neural Networks
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Neural networks with non-instant neuron firing?
    Wondering if you guys could give me some terms to Google. Biological neurons have a refractory period after firing in which they cannot fire again. Additionally, charges/neurotransmitters take a certain amount of time to build up in the target cell. Firings tend to follow this voltage over time graph. Is there a type of neural network with a more time-delayed or first/second order time-derivative firing behavior? One benefit I can think of is that if you've got a more ad-hoc network structure, a traditional NN may have two neurons feeding into each other which would result in a 1 "tick" oscillation or cascading activations since there is no limit on firing rate. EDIT: Finally found it. They're called "Spiking Neural Networks" (SNNs). submitted by /u/JDude13 [link] [comments]  ( 41 min )
  • Open

    How do I use Thompson Sampling with non-binary rewards?
    Any suggestions and/or resources to understand and implement this? submitted by /u/Blasphemer666 [link] [comments]  ( 41 min )
    I Was At The Legendary Man Vs Machine Chess Match - It Was Similar To The AlphaGo AI Match. (Podcast Clip)
    submitted by /u/joemurray1994 [link] [comments]  ( 41 min )
    "An Invitation to Imitation", Bagnell 2015 (tutorial on imitation learning, DAGGer etc)
    submitted by /u/gwern [link] [comments]  ( 41 min )
    Does a bigger model or inclusion of an specialized preprocessing unit result in a more stable learning losses?
    Hello guys, I am trying to fit a DQN on price data. I know its virtually impossible and not profitable in live trading. BUT, the model I am training is currently plagued with rather unstable profits, after like 5 hours of training on an A100. It's clear that is learning something, but the profits are still rather unpredictable. I wanted to know which remedies you recommend to improving its stability? Larger network? Or an auto encoder or something like that for data preprocessing? Thank you submitted by /u/Kiizmod0 [link] [comments]  ( 41 min )
    Motivation behind asynchronous updates in A3C
    I'm currently reading Mnih's monumental paper on asynchronous updates in on policy RL algorithms. When he introduces it, he states that the worker agents experience copies of the environment and this parallelism De-correlates the agent's data into a more stationary process, since at any time step the agent would be experiencing a wide variety of states. I only have an intuition of what this could mean but I'm not sure if I'm correct. I believe that the decorrelation has to do with how updates are made in an on policy algorithm such as REINFORCE. We end up increasing the probability of actions based on the return. In subsequent updates, we are more likely to take this action and this would only increase the probability of taking it further. As a result the updates are correlated. I'm not sure how multiple parallel experiences are connected to stationarity. If we have a global policy that is evolving and all of the experiences are initialised with the global policy, don't they all have the same, shifting distribution? I'd appreciate any insight on this or anything I mentioned. Thanks! submitted by /u/theanswerisnt42 [link] [comments]  ( 43 min )
  • Open

    Will you be switching to Bing for its chatGPT integration?
    View Poll submitted by /u/arnolds112 [link] [comments]  ( 41 min )
    Using A.I. to create Audio documents/books
    Hello, I have no idea where else to ask this. I'm not very tech savvy either but I've listened to the recent A.I. voice recordings and heard that they use input text to work (if I'm not mistaken?). So with that in mind would it be possible to copy-paste a files text in it and have the A.I. read it into a recording? The reason I want to do this is because often I have to read a lot for my projects and would sometimes like to listen to information rather than being on my screen for 10 hours locked inside my home so if I could self create these form of audio files it would benefit me extremely. submitted by /u/Googlefriend1 [link] [comments]  ( 42 min )
    Athene AI show
    Is there like a public version of what he uses? Or something similar it’s so interesting submitted by /u/x2hunter [link] [comments]  ( 41 min )
    Awesome No/low-code AI App Builder
    ​ https://reddit.com/link/10xbrxo/video/5bvwl3qv91ha1/player Hey everyone! Wanted to share a platform a few others & myself have been working on: an easy way to build apps on top of AI Large Language Models such as GPT-3. This video shows the simplest app you can build -- just take one of your great ChatGPT prompts and quickly turn it into a shareable app with an interface anyone can use. You can also chain LLMs together, feed in data from outside sources, and a whole lot more. We have found a lot of value in this community, and so wanted to give you all early access before we release it on a wider scale: Agent Beta Access. Would love any feedback! Btw, here is the app in the video (it's a fun one!): King James Bible Verse Generator. submitted by /u/leopuli [link] [comments]  ( 41 min )
    An oddity while generating celebrity images.
    So, I was testing out how different AI image generators make famous people(and human faces in general). So I tried many free ones like DallE mini, and Imagine, etc as well as some premium ones like Midjourney, Wombo Dream, Stable Diffusion, DallE 2 (Granted Dall E didn't allow it most of the time). I generated peole like RDJ, Michael Jackson, Scarlett Johansson, Madison Beer, and even some you tubers. All of them turned out fine even with different art styles. With the exception of one person... Billie Eilish. Every time I generated her face, I laughed hard. In every image generator, she looked like someone with both High function autism and extreme downs syndrome. Not saying that that you should laugh at those people. But those pictures were funny af. Any clue as to why it happens with her face? submitted by /u/offensivepanzerIV [link] [comments]  ( 41 min )
    The Threat of AI-Powered Chatbots in Spreading Misinformation: New Concerns Raised
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    Regulating AI tools like ChatGPT could be “problematic”
    General purpose artificial intelligence tools will provide new challenges for regulators, which they may struggle to meet. I spoke to several experts to get a better idea of how they think it should be regulated. The general feeling is that for the most part it should be “light touch” and based on risk - so asking medical questions is high risk and tightly regulated, asking for a poem is low risk. There is also concerns GDPR could come into play and a company like OpenAI could be forced to delete its model and training data if enough people complain. https://techmonitor.ai/technology/ai-and-automation/ai-regulation-chatgpt-bard submitted by /u/upyourego [link] [comments]  ( 42 min )
    AI Dream 131 - Transcendence Visualized by AI - Weird Wednesday
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Best Artificial Intelligence courses for Healthcare You should learn
    submitted by /u/Lakshmireddys [link] [comments]  ( 40 min )
    How Can I Become an Early Adopter of AI?
    I've been keeping tabs on what's going on in the AI space for the last year probably and have been fascinated by the possibilities for ChatGPT and Bard. I'm not a huge techie and can't program, but I am intelligent and know my way around a computer. How could someone like me start using AI to make their life easier? What uses does it have in my job, my day to day life, or how could I use it to streamline things I already do? I've been searching online for answers to these questions but really haven't found any. I see huge potential here (as does almost everyone in the world), I'm just not sure what I can do with it at this point in time. submitted by /u/Hot_Ropes_Of_Gum [link] [comments]  ( 42 min )
    Discover and learn faster with Perplexity AI. For any question you have, AI will now link to new topics you might be curious about next. Whether you’re reading about people, places, or products, you can now explore and read more simply by clicking.
    submitted by /u/rafs2006 [link] [comments]  ( 41 min )
    Google won't penalize AI content, introduces new AI content guidelines
    submitted by /u/henlo_there_fren [link] [comments]  ( 40 min )
    Anyone know a programm that finds a location of a picture through AI?
    submitted by /u/Pierruno [link] [comments]  ( 40 min )
    Create an AI powered Gmail Add-on
    AI powered Gmail addon that helps you write 10X faster email saving your time for more important things. I worked in corporate and had to deal with unhealthy amount of emails everyday. This left me no time to work and I would end up working overtime. We have built this Gmail addon to help people who face similar problems to mine. We are also offering pay per use payment model to help you keep track of where you spend your money. Here's the download link : https://workspace.google.com/marketplace/app/buzz_mail/650469784389 submitted by /u/Spiritual-Fact-4059 [link] [comments]  ( 41 min )
    Bing And Edge Get An Upgrade With ChatGPT AI
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Is there a Local Text Generation AI?
    Greetings! I started out testing AI with the Crayon AI Image generator but I really had a lot of fun when I downloaded Stable Diffusion locally, using the Web-UI to generate images. Now I am having a lot of fun testing ChatGPT, mostly seeing what it can do writing creative short stories, but it can be a pain with the high capacity it goes through, and I really do like having things save and store locally as I generate them. Is this a thing? Anything on the level of ChatGPT at all? Thanks in advance! submitted by /u/Aldrnarii [link] [comments]  ( 41 min )
    I Was At The Legendary Man Vs Machine Chess Match - It Was Similar To The AlphaGo AI Match. (Podcast Clip)
    submitted by /u/joemurray1994 [link] [comments]  ( 40 min )
    2,000+ ChatGPT/AI tools. https://www.aisearchtool.com/
    AI Tools Directory submitted by /u/HornOkPlease9596 [link] [comments]  ( 40 min )
    Predict who will have the best AI search in 1 year's time:
    View Poll submitted by /u/tlkop123 [link] [comments]  ( 42 min )
    Health Care Bias Is Dangerous. But So Are ‘Fairness’ Algorithms
    submitted by /u/Fantomas77 [link] [comments]  ( 40 min )
    How Artificial Intelligence Will Help Find Your Purpose
    submitted by /u/derstarkerwille [link] [comments]  ( 41 min )
    Could AI Search Engines cause a chain reaction that results in the loss of hundreds of thousands of websites?
    So if you search something now you usually end up on a website that runs ads to pay for servers, editors etc. However, if ChatGPT and Bard will be fully implemented into Bing and Google, there is no need to visit these websites anymore because you get the answer you need right away. Wouldn't that result in a shitton of websites closing down, which then results in AIs having a harder time to get their hand on correct information? submitted by /u/neco_dota [link] [comments]  ( 48 min )
    Generate quizzes from any text in one click using Artificial Intelligence inside Google Form
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    Google Bard, ChatGPT competitor with Artificial Intelligence, available soon to everyone
    submitted by /u/nickkgar [link] [comments]  ( 40 min )
  • Open

    Optimize your machine learning deployments with auto scaling on Amazon SageMaker
    Machine learning (ML) has become ubiquitous. Our customers are employing ML in every aspect of their business, including the products and services they build, and for drawing insights about their customers. To build an ML-based application, you have to first build the ML model that serves your business requirement. Building ML models involves preparing the […]  ( 15 min )
  • Open

    Unsupervised and semi-supervised anomaly detection with data-centric ML
    Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Google Research, Cloud AI Team Anomaly detection (AD), the task of distinguishing anomalies from normal data, plays a vital role in many real-world applications, such as detecting faulty products from vision sensors in manufacturing, fraudulent behaviors in financial transactions, or network security threats. Depending on the availability of the type of data — negative (normal) vs. positive (anomalous) and the availability of their labels — the task of AD involves different challenges. (a) Fully supervised anomaly detection, (b) normal-only anomaly detection, (c, d, e) semi-supervised anomaly detection, (f) unsupervised anomaly detection. While most previous works were shown to be effective for cases with fully-lab…  ( 92 min )
  • Open

    [R] pix2pixzero - Zero-shot Image-to-Image Translation
    submitted by /u/gecko39 [link] [comments]  ( 43 min )
    [Discussion] Cognitive science inspired AI research
    I came across a few comments on this community about researchers developing AI algorithms inspired by ideas from neuroscience/cognition. I'd like to know how successful this approach has been in terms of coming up with new perspectives on problems. What are some of the key issues researchers are trying to address this way? What are some future directions in which research may progress? I have a rough idea that this could be one way to inspire sample efficient RL but I'd love to hear about other work that goes on in this area submitted by /u/theanswerisnt42 [link] [comments]  ( 43 min )
    [D] List of RL Papers
    Hi, I want to open a thread about RL (non-deep and deep) What are the papers/books that are "must read" to have strong foundation? submitted by /u/C_l3b [link] [comments]  ( 43 min )
    [P] Scripts/Programs to collect Baseline Logs
    Abit of a weird question. So I'm required to make & collect some clean (baseline) logs and dirty (malicious) logs for some mini-ML project I'm doing. So my question is, is there any scripts or programs out there, Linux or Windows, that allows the automation of mimicking an office staff doing work (ie. opening Outlook, sending emails, surfing the web, watching YouTube, opening and editing Word/Excel files, etc.) for the purpose of collecting baseline logs? I'm relative new to this kind of thing, if you guys have better suggestion on a more better/efficient way to do this, feel free to suggest! submitted by /u/Dweeberbob [link] [comments]  ( 43 min )
    [N] New Book on Synthetic Data​: Version 3.0 Just Released
    The book has considerably grown since version 1.0. It started with synthetic data as one of the main components, but also diving into explainable AI, intuitive / interpretable machine learning, and generative AI. Now with 272 pages (up from 156 in the first version), the focus is clearly on synthetic data. Of course, I still discuss explainable and generative AI: these concepts are strongly related to data synthetization. Agent-based modeling in action However many new chapters have been added, covering various aspects of synthetic data — in particular working with more diversified real datasets, how to synthetize them, how to generate high quality random numbers with a very fast algorithm based on digits of irrational numbers, with visual illustrations and Python code in all chapters. I…  ( 47 min )
  • Open

    Research Focus: Week of February 6, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Behind the Tech podcast with Tobi Lütke: CEO and Founder, Shopify In the latest episode of Behind the Tech, Microsoft CTO Kevin Scott is joined by Tobi […] The post Research Focus: Week of February 6, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    New NVIDIA Studio Laptops Powered by GeForce RTX 4090, 4080 Laptop GPUs Unleash Creativity
    The first NVIDIA Studio laptops powered by GeForce RTX 40 Series Laptop GPUs are now available, starting with systems from MSI and Razer — with many more to come.  ( 8 min )
  • Open

    Hénon’s dynamical system
    This post will reproduce a three plots from a paper of Hénon on dynamical systems from 1969 [1]. Let α be a constant, and pick some starting point in the plane, (x0, y0), then update x and y according to xn+1 = xn cos α − (yn − xn²) sin α yn+1 = xn sin […] Hénon’s dynamical system first appeared on John D. Cook.  ( 6 min )
  • Open

    Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees. (arXiv:2206.02659v4 [cs.LG] UPDATED)
    We consider transfer learning approaches that fine-tune a pretrained deep neural network on a target task. We study the generalization properties of fine-tuning to understand the problem of overfitting, which commonly occurs in practice. Previous works have shown that constraining the distance from the initialization of fine-tuning improves generalization. Using a PAC-Bayesian analysis, we observe that besides distance from initialization, Hessians affect generalization through the noise stability of deep neural networks against noise injections. Motivated by the observation, we develop Hessian distance-based generalization bounds for a wide range of fine-tuning methods. Additionally, we study the robustness of fine-tuning in the presence of noisy labels. We design an algorithm incorporating consistent losses and distance-based regularization for fine-tuning, along with a generalization error guarantee under class conditional independent noise in the training set labels. We perform a detailed empirical study of our algorithm on various noisy environments and architectures. On six image classification tasks whose training labels are generated with programmatic labeling, we find a 3.26% accuracy gain over prior fine-tuning methods. Meanwhile, the Hessian distance measure of the fine-tuned model decreases by six times more than existing approaches.  ( 2 min )
    Representational dissimilarity metric spaces for stochastic neural networks. (arXiv:2211.11665v2 [cs.LG] UPDATED)
    Quantifying similarity between neural representations -- e.g. hidden layer activation vectors -- is a perennial problem in deep learning and neuroscience research. Existing methods compare deterministic responses (e.g. artificial networks that lack stochastic layers) or averaged responses (e.g., trial-averaged firing rates in biological data). However, these measures of _deterministic_ representational similarity ignore the scale and geometric structure of noise, both of which play important roles in neural computation. To rectify this, we generalize previously proposed shape metrics (Williams et al. 2021) to quantify differences in _stochastic_ representations. These new distances satisfy the triangle inequality, and thus can be used as a rigorous basis for many supervised and unsupervised analyses. Leveraging this novel framework, we find that the stochastic geometries of neurobiological representations of oriented visual gratings and naturalistic scenes respectively resemble untrained and trained deep network representations. Further, we are able to more accurately predict certain network attributes (e.g. training hyperparameters) from its position in stochastic (versus deterministic) shape space.  ( 2 min )
    Robust Active Distillation. (arXiv:2210.01213v2 [cs.LG] UPDATED)
    Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited amount of labeled data is available. In large-scale applications, however, the teacher tends to provide a large number of incorrect soft-labels that impairs student performance. The sheer size of the teacher additionally constrains the number of soft-labels that can be queried due to prohibitive computational and/or financial costs. The difficulty in achieving simultaneous \emph{efficiency} (i.e., minimizing soft-label queries) and \emph{robustness} (i.e., avoiding student inaccuracies due to incorrect labels) hurts the widespread application of knowledge distillation to many modern tasks. In this paper, we present a parameter-free approach with provable guarantees to query the soft-labels of points that are simultaneously informative and correctly labeled by the teacher. At the core of our work lies a game-theoretic formulation that explicitly considers the inherent trade-off between the informativeness and correctness of input instances. We establish bounds on the expected performance of our approach that hold even in worst-case distillation instances. We present empirical evaluations on popular benchmarks that demonstrate the improved distillation performance enabled by our work relative to that of state-of-the-art active learning and active distillation methods.  ( 2 min )
    Weakly-Supervised 3D Medical Image Segmentation using Geometric Prior and Contrastive Similarity. (arXiv:2302.02125v1 [eess.IV])
    Medical image segmentation is almost the most important pre-processing procedure in computer-aided diagnosis but is also a very challenging task due to the complex shapes of segments and various artifacts caused by medical imaging, (i.e., low-contrast tissues, and non-homogenous textures). In this paper, we propose a simple yet effective segmentation framework that incorporates the geometric prior and contrastive similarity into the weakly-supervised segmentation framework in a loss-based fashion. The proposed geometric prior built on point cloud provides meticulous geometry to the weakly-supervised segmentation proposal, which serves as better supervision than the inherent property of the bounding-box annotation (i.e., height and width). Furthermore, we propose contrastive similarity to encourage organ pixels to gather around in the contrastive embedding space, which helps better distinguish low-contrast tissues. The proposed contrastive embedding space can make up for the poor representation of the conventionally-used gray space. Extensive experiments are conducted to verify the effectiveness and the robustness of the proposed weakly-supervised segmentation framework. The proposed framework is superior to state-of-the-art weakly-supervised methods on the following publicly accessible datasets: LiTS 2017 Challenge, KiTS 2021 Challenge, and LPBA40. We also dissect our method and evaluate the performance of each component.  ( 2 min )
    Augmenting Interpretable Knowledge Tracing by Ability Attribute and Attention Mechanism. (arXiv:2302.02146v1 [cs.LG])
    Knowledge tracing aims to model students' past answer sequences to track the change in their knowledge acquisition during exercise activities and to predict their future learning performance. Most existing approaches ignore the fact that students' abilities are constantly changing or vary between individuals, and lack the interpretability of model predictions. To this end, in this paper, we propose a novel model based on ability attributes and attention mechanism. We first segment the interaction sequences and captures students' ability attributes, then dynamically assign students to groups with similar abilities, and quantify the relevance of the exercises to the skill by calculating the attention weights between the exercises and the skill to enhance the interpretability of the model. We conducted extensive experiments and evaluate real online education datasets. The results confirm that the proposed model is better at predicting performance than five well-known representative knowledge tracing models, and the model prediction results are explained through an inference path.  ( 2 min )
    Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization. (arXiv:2102.11050v4 [cs.LG] UPDATED)
    Motivated by online decision-making in time-varying combinatorial environments, we study the problem of transforming offline algorithms to their online counterparts. We focus on offline combinatorial problems that are amenable to a constant factor approximation using a greedy algorithm that is robust to local errors. For such problems, we provide a general framework that efficiently transforms offline robust greedy algorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have $O(\sqrt{T})$ (approximate) regret under the full information setting. We further introduce a bandit extension of Blackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transform greedy robust offline algorithms into a $O(T^{2/3})$ (approximate) regret in the bandit setting. Demonstrating the flexibility of our framework, we apply our offline-to-online transformation to several problems at the intersection of revenue management, market design, and online optimization, including product ranking optimization in online platforms, reserve price optimization in auctions, and submodular maximization. We also extend our reduction to greedy-like first order methods used in continuous optimization, such as those used for maximizing continuous strong DR monotone submodular functions subject to convex constraints. We show that our transformation, when applied to these applications, leads to new regret bounds or improves the current known bounds. We complement our theoretical studies by conducting numerical simulations for two of our applications, in both of which we observe that the numerical performance of our transformations outperforms the theoretical guarantees in practical instances.  ( 3 min )
    Model Stitching and Visualization How GAN Generators can Invert Networks in Real-Time. (arXiv:2302.02181v1 [cs.CV])
    Critical applications, such as in the medical field, require the rapid provision of additional information to interpret decisions made by deep learning methods. In this work, we propose a fast and accurate method to visualize activations of classification and semantic segmentation networks by stitching them with a GAN generator utilizing convolutions. We test our approach on images of animals from the AFHQ wild dataset and real-world digital pathology scans of stained tissue samples. Our method provides comparable results to established gradient descent methods on these datasets while running about two orders of magnitude faster.  ( 2 min )
    Equivariance with Learned Canonicalization Functions. (arXiv:2211.06489v2 [cs.LG] UPDATED)
    Symmetry-based neural networks often constrain the architecture in order to achieve invariance or equivariance to a group of transformations. In this paper, we propose an alternative that avoids this architectural constraint by learning to produce a canonical representation of the data. These canonicalization functions can readily be plugged into non-equivariant backbone architectures. We offer explicit ways to implement them for many groups of interest. We show that this approach enjoys universality while providing interpretable insights. Our main hypothesis is that learning a neural network to perform canonicalization is better than using predefined heuristics. Our results show that learning the canonicalization function indeed leads to better results and that the approach achieves excellent performance in practice.  ( 2 min )
    VR-LENS: Super Learning-based Cybersickness Detection and Explainable AI-Guided Deployment in Virtual Reality. (arXiv:2302.01985v1 [cs.LG])
    A plethora of recent research has proposed several automated methods based on machine learning (ML) and deep learning (DL) to detect cybersickness in Virtual reality (VR). However, these detection methods are perceived as computationally intensive and black-box methods. Thus, those techniques are neither trustworthy nor practical for deploying on standalone VR head-mounted displays (HMDs). This work presents an explainable artificial intelligence (XAI)-based framework VR-LENS for developing cybersickness detection ML models, explaining them, reducing their size, and deploying them in a Qualcomm Snapdragon 750G processor-based Samsung A52 device. Specifically, we first develop a novel super learning-based ensemble ML model for cybersickness detection. Next, we employ a post-hoc explanation method, such as SHapley Additive exPlanations (SHAP), Morris Sensitivity Analysis (MSA), Local Interpretable Model-Agnostic Explanations (LIME), and Partial Dependence Plot (PDP) to explain the expected results and identify the most dominant features. The super learner cybersickness model is then retrained using the identified dominant features. Our proposed method identified eye tracking, player position, and galvanic skin/heart rate response as the most dominant features for the integrated sensor, gameplay, and bio-physiological datasets. We also show that the proposed XAI-guided feature reduction significantly reduces the model training and inference time by 1.91X and 2.15X while maintaining baseline accuracy. For instance, using the integrated sensor dataset, our reduced super learner model outperforms the state-of-the-art works by classifying cybersickness into 4 classes (none, low, medium, and high) with an accuracy of 96% and regressing (FMS 1-10) with a Root Mean Square Error (RMSE) of 0.03.  ( 2 min )
    The Predictive Forward-Forward Algorithm. (arXiv:2301.01452v2 [cs.LG] UPDATED)
    We propose the predictive forward-forward (PFF) algorithm for conducting credit assignment in neural systems. Specifically, we design a novel, dynamic recurrent neural system that learns a directed generative circuit jointly and simultaneously with a representation circuit, integrating learnable lateral competition and elements of predictive coding, an emerging and viable neurobiological process theory of cortical function, with the forward-forward (FF) adaptation scheme. Furthermore, PFF efficiently learns to propagate learning signals and updates synapses with forward passes only, eliminating key structural and computational constraints imposed by a backpropagation-based scheme. Besides computational advantages, the PFF process could prove useful for understanding the learning mechanisms behind biological neurons that use local signals despite missing feedback connections. We run experiments on image data and demonstrate that the PFF procedure works as well as backpropagation of errors, offering a promising brain-inspired learning algorithm for classifying, reconstructing, and synthesizing data patterns.  ( 2 min )
    Locally Constrained Policy Optimization for Online Reinforcement Learning in Non-Stationary Input-Driven Environments. (arXiv:2302.02182v1 [cs.LG])
    We study online Reinforcement Learning (RL) in non-stationary input-driven environments, where a time-varying exogenous input process affects the environment dynamics. Online RL is challenging in such environments due to catastrophic forgetting (CF). The agent tends to forget prior knowledge as it trains on new experiences. Prior approaches to mitigate this issue assume task labels (which are often not available in practice) or use off-policy methods that can suffer from instability and poor performance. We present Locally Constrained Policy Optimization (LCPO), an on-policy RL approach that combats CF by anchoring policy outputs on old experiences while optimizing the return on current experiences. To perform this anchoring, LCPO locally constrains policy optimization using samples from experiences that lie outside of the current input distribution. We evaluate LCPO in two gym and computer systems environments with a variety of synthetic and real input traces, and find that it outperforms state-of-the-art on-policy and off-policy RL methods in the online setting, while achieving results on-par with an offline agent pre-trained on the whole input trace.  ( 2 min )
    LExecutor: Learning-Guided Execution. (arXiv:2302.02343v1 [cs.SE])
    Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 80.1% and 94.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 50.1%.  ( 2 min )
    Efficient Gradient Approximation Method for Constrained Bilevel Optimization. (arXiv:2302.01970v1 [cs.LG])
    Bilevel optimization has been developed for many machine learning tasks with large-scale and high-dimensional data. This paper considers a constrained bilevel optimization problem, where the lower-level optimization problem is convex with equality and inequality constraints and the upper-level optimization problem is non-convex. The overall objective function is non-convex and non-differentiable. To solve the problem, we develop a gradient-based approach, called gradient approximation method, which determines the descent direction by computing several representative gradients of the objective function inside a neighborhood of the current estimate. We show that the algorithm asymptotically converges to the set of Clarke stationary points, and demonstrate the efficacy of the algorithm by the experiments on hyperparameter optimization and meta-learning.  ( 2 min )
    Knowledge Distillation $\approx$ Label Smoothing: Fact or Fallacy?. (arXiv:2301.12609v2 [cs.LG] UPDATED)
    Contrary to its original interpretation as a facilitator of knowledge transfer from one model to another, some recent studies have suggested that knowledge distillation (KD) is instead a form of regularization. Perhaps the strongest support of all for this claim is found in its apparent similarities with label smoothing (LS). This paper investigates the stated equivalence of these two methods by examining the predictive uncertainties of the models they train. Experiments on four text classification tasks involving teachers and students of different capacities show that: (a) In most settings, KD and LS drive model uncertainty (entropy) in completely opposite directions, and (b) In KD, the student's predictive uncertainty is a direct function of that of its teacher, reinforcing the knowledge transfer view.  ( 2 min )
    FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours. (arXiv:2203.00854v3 [cs.LG] UPDATED)
    Protein structure prediction helps to understand gene translation and protein function, which is of growing interest and importance in structural biology. The AlphaFold model, which used transformer architecture to achieve atomic-level accuracy in protein structure prediction, was a significant breakthrough. However, training and inference of the AlphaFold model are challenging due to its high computation and memory cost. In this work, we present FastFold, an efficient implementation of AlphaFold for both training and inference. We propose Dynamic Axial Parallelism and Duality Async Operations to improve the scaling efficiency of model parallelism. Besides, AutoChunk is proposed to reduce memory cost by over 80% during inference by automatically determining the chunk strategy. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves 7.5X - 9.5X speedup for long-sequence inference. Furthermore, we scale FastFold to 512 GPUs and achieve an aggregate throughput of 6.02 PetaFLOP/s with 90.1% parallel efficiency.
    Bayesian Optimization-based Combinatorial Assignment. (arXiv:2208.14698v4 [cs.LG] UPDATED)
    We study the combinatorial assignment domain, which includes combinatorial auctions and course allocation. The main challenge in this domain is that the bundle space grows exponentially in the number of items. To address this, several papers have recently proposed machine learning-based preference elicitation algorithms that aim to elicit only the most important information from agents. However, the main shortcoming of this prior work is that it does not model a mechanism's uncertainty over values for not yet elicited bundles. In this paper, we address this shortcoming by presenting a Bayesian optimization-based combinatorial assignment (BOCA) mechanism. Our key technical contribution is to integrate a method for capturing model uncertainty into an iterative combinatorial auction mechanism. Concretely, we design a new method for estimating an upper uncertainty bound that can be used to define an acquisition function to determine the next query to the agents. This enables the mechanism to properly explore (and not just exploit) the bundle space during its preference elicitation phase. We run computational experiments in several spectrum auction domains to evaluate BOCA's performance. Our results show that BOCA achieves higher allocative efficiency than state-of-the-art approaches.
    A Concept Knowledge Graph for User Next Intent Prediction at Alipay. (arXiv:2301.00503v2 [cs.CL] UPDATED)
    This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. To explicitly characterize user intent, we propose \textbf{AlipayKG}, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
    Neural Time Series Analysis with Fourier Transform: A Survey. (arXiv:2302.02173v1 [cs.LG])
    Recently, Fourier transform has been widely introduced into deep neural networks to further advance the state-of-the-art regarding both accuracy and efficiency of time series analysis. The advantages of the Fourier transform for time series analysis, such as efficiency and global view, have been rapidly explored and exploited, exhibiting a promising deep learning paradigm for time series analysis. However, although increasing attention has been attracted and research is flourishing in this emerging area, there lacks a systematic review of the variety of existing studies in the area. To this end, in this paper, we provide a comprehensive review of studies on neural time series analysis with Fourier transform. We aim to systematically investigate and summarize the latest research progress. Accordingly, we propose a novel taxonomy to categorize existing neural time series analysis methods from four perspectives, including characteristics, usage paradigms, network design, and applications. We also share some new research directions in this vibrant area.  ( 2 min )
    SE(3) diffusion model with application to protein backbone generation. (arXiv:2302.02277v1 [cs.LG])
    The design of novel protein structures remains a challenge in protein engineering for applications across biomedicine and chemistry. In this line of work, a diffusion model over rigid bodies in 3D (referred to as frames) has shown success in generating novel, functional protein backbones that have not been observed in nature. However, there exists no principled methodological framework for diffusion on SE(3), the space of orientation preserving rigid motions in R3, that operates on frames and confers the group invariance. We address these shortcomings by developing theoretical foundations of SE(3) invariant diffusion models on multiple frames followed by a novel framework, FrameDiff, for learning the SE(3) equivariant score over multiple frames. We apply FrameDiff on monomer backbone generation and find it can generate designable monomers up to 500 amino acids without relying on a pretrained protein structure prediction network that has been integral to previous methods. We find our samples are capable of generalizing beyond any known protein structure.
    TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Data. (arXiv:2302.02224v1 [cs.LG])
    This work investigates the intersection of cross modal learning and semi supervised learning, where we aim to improve the supervised learning performance of the primary modality by borrowing missing information from an unlabeled modality. We investigate this problem from a Nadaraya Watson (NW) kernel regression perspective and show that this formulation implicitly leads to a kernelized cross attention module. To this end, we propose The Attention Patch (TAP), a simple neural network plugin that allows data level knowledge transfer from the unlabeled modality. We provide numerical simulations on three real world datasets to examine each aspect of TAP and show that a TAP integration in a neural network can improve generalization performance using the unlabeled modality.  ( 2 min )
    Detecting Security Patches via Behavioral Data in Code Repositories. (arXiv:2302.02112v1 [cs.CR])
    The absolute majority of software today is developed collaboratively using collaborative version control tools such as Git. It is a common practice that once a vulnerability is detected and fixed, the developers behind the software issue a Common Vulnerabilities and Exposures or CVE record to alert the user community of the security hazard and urge them to integrate the security patch. However, some companies might not disclose their vulnerabilities and just update their repository. As a result, users are unaware of the vulnerability and may remain exposed. In this paper, we present a system to automatically identify security patches using only the developer behavior in the Git repository without analyzing the code itself or the remarks that accompanied the fix (commit message). We showed we can reveal concealed security patches with an accuracy of 88.3% and F1 Score of 89.8%. This is the first time that a language-oblivious solution for this problem is presented.  ( 2 min )
    Machine Learning Methods for Evaluating Public Crisis: Meta-Analysis. (arXiv:2302.02267v1 [cs.LG])
    This study examines machine learning methods used in crisis management. Analyzing detected patterns from a crisis involves the collection and evaluation of historical or near-real-time datasets through automated means. This paper utilized the meta-review method to analyze scientific literature that utilized machine learning techniques to evaluate human actions during crises. Selected studies were condensed into themes and emerging trends using a systematic literature evaluation of published works accessed from three scholarly databases. Results show that data from social media was prominent in the evaluated articles with 27% usage, followed by disaster management, health (COVID) and crisis informatics, amongst many other themes. Additionally, the supervised machine learning method, with an application of 69% across the board, was predominant. The classification technique stood out among other machine learning tasks with 41% usage. The algorithms that played major roles were the Support Vector Machine, Neural Networks, Naive Bayes, and Random Forest, with 23%, 16%, 15%, and 12% contributions, respectively.
    Learn Proportional Derivative Controllable Latent Space from Pixels. (arXiv:2110.08239v2 [cs.LG] UPDATED)
    Recent advances in latent space dynamics model from pixels show promising progress in vision-based model predictive control (MPC). However, executing MPC in real time can be challenging due to its intensive computational cost in each timestep. We propose to introduce additional learning objectives to enforce that the learned latent space is proportional derivative controllable. In execution time, the simple PD-controller can be applied directly to the latent space encoded from pixels, to produce simple and effective control to systems with visual observations. We show that our method outperforms baseline methods to produce robust goal reaching and trajectory tracking in various environments.
    Neighboring state-based RL Exploration. (arXiv:2212.10712v2 [cs.LG] UPDATED)
    Reinforcement Learning is a powerful tool to model decision-making processes. However, it relies on an exploration-exploitation trade-off that remains an open challenge for many tasks. In this work, we study neighboring state-based, model-free exploration led by the intuition that, for an early-stage agent, considering actions derived from a bounded region of nearby states may lead to better actions when exploring. We propose two algorithms that choose exploratory actions based on a survey of nearby states, and find that one of our methods, ${\rho}$-explore, consistently outperforms the Double DQN baseline in an discrete environment by 49\% in terms of Eval Reward Return.
    Neural Optimal Transport with General Cost Functionals. (arXiv:2205.15403v2 [cs.LG] UPDATED)
    Neural optimal transport techniques mostly use Euclidean cost functions, such as $\ell^1$ or $\ell^2$. These costs are suitable for translation tasks between related domains, but they are hardly applicable to problems where a specific non-Euclidean optimality of the mapping is required such as dataset transfer. To tackle this issue, we introduce a novel neural network-based algorithm to compute optimal transport plans and maps for general cost functionals. Such functionals provide more flexibility and allow using auxiliary information, such as class labels, to construct the required transport map. Our method is based on a saddle point reformulation of the optimal transport problem and generalizes prior methods for weak and strong transport cost functionals. As an application, we construct a functional to map data distributions with preserving the class-wise structure of data.
    Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization. (arXiv:2302.02299v1 [cs.LG])
    Recent success in Deep Reinforcement Learning (DRL) methods has shown that policy optimization with respect to an off-policy distribution via importance sampling is effective for sample reuse. In this paper, we show that the use of importance sampling could introduce high variance in the objective estimate. Specifically, we show in a principled way that the variance of importance sampling estimate grows quadratically with importance ratios and the large ratios could consequently jeopardize the effectiveness of surrogate objective optimization. We then propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high. We instantiate this sample dropout technique on representative policy optimization algorithms, including TRPO, PPO, and ESPO, and demonstrate that it consistently boosts the performance of those DRL algorithms on both continuous and discrete action controls, including MuJoCo, DMControl and Atari video games. Our code is open-sourced at \url{https://github.com/LinZichuan/sdpo.git}.
    Using Intermediate Forward Iterates for Intermediate Generator Optimization. (arXiv:2302.02336v1 [cs.LG])
    Score-based models have recently been introduced as a richer framework to model distributions in high dimensions and are generally more suitable for generative tasks. In score-based models, a generative task is formulated using a parametric model (such as a neural network) to directly learn the gradient of such high dimensional distributions, instead of the density functions themselves, as is done traditionally. From the mathematical point of view, such gradient information can be utilized in reverse by stochastic sampling to generate diverse samples. However, from a computational perspective, existing score-based models can be efficiently trained only if the forward or the corruption process can be computed in closed form. By using the relationship between the process and layers in a feed-forward network, we derive a backpropagation-based procedure which we call Intermediate Generator Optimization to utilize intermediate iterates of the process with negligible computational overhead. The main advantage of IGO is that it can be incorporated into any standard autoencoder pipeline for the generative task. We analyze the sample complexity properties of IGO to solve downstream tasks like Generative PCA. We show applications of the IGO on two dense predictive tasks viz., image extrapolation, and point cloud denoising. Our experiments indicate that obtaining an ensemble of generators for various time points is possible using first-order methods.
    Multi-Center Federated Learning: Clients Clustering for Better Personalization. (arXiv:2005.01026v3 [cs.LG] UPDATED)
    Federated learning has received great attention for its capability to train a large-scale model in a decentralized manner without needing to access user data directly. It helps protect the users' private data from centralized collecting. Unlike distributed machine learning, federated learning aims to tackle non-IID data from heterogeneous sources in various real-world applications, such as those on smartphones. Existing federated learning approaches usually adopt a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. However, due to the diverse nature of user behaviors, assigning users' gradients to different global models (i.e., centers) can better capture the heterogeneity of data distributions across users. Our paper proposes a novel multi-center aggregation mechanism for federated learning, which learns multiple global models from the non-IID user data and simultaneously derives the optimal matching between users and centers. We formulate the problem as a joint optimization that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Our experimental results on benchmark datasets show that our method outperforms several popular federated learning methods.
    Learning in quantum games. (arXiv:2302.02333v1 [cs.GT])
    In this paper, we introduce a class of learning dynamics for general quantum games, that we call "follow the quantum regularized leader" (FTQL), in reference to the classical "follow the regularized leader" (FTRL) template for learning in finite games. We show that the induced quantum state dynamics decompose into (i) a classical, commutative component which governs the dynamics of the system's eigenvalues in a way analogous to the evolution of mixed strategies under FTRL; and (ii) a non-commutative component for the system's eigenvectors which has no classical counterpart. Despite the complications that this non-classical component entails, we find that the FTQL dynamics incur no more than constant regret in all quantum games. Moreover, adjusting classical notions of stability to account for the nonlinear geometry of the state space of quantum games, we show that only pure quantum equilibria can be stable and attracting under FTQL while, as a partial converse, pure equilibria that satisfy a certain "variational stability" condition are always attracting. Finally, we show that the FTQL dynamics are Poincar\'e recurrent in quantum min-max games, extending in this way a very recent result for the quantum replicator dynamics.
    DiGress: Discrete Denoising diffusion for graph generation. (arXiv:2209.14734v3 [cs.LG] UPDATED)
    This work introduces DiGress, a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. Our model utilizes a discrete diffusion process that progressively edits graphs with noise, through the process of adding or removing edges and changing the categories. A graph transformer network is trained to revert this process, simplifying the problem of distribution learning over graphs into a sequence of node and edge classification tasks. We further improve sample quality by introducing a Markovian noise model that preserves the marginal distribution of node and edge types during diffusion, and by incorporating auxiliary graph-theoretic features. A procedure for conditioning the generation on graph-level features is also proposed. DiGress achieves state-of-the-art performance on molecular and non-molecular datasets, with up to 3x validity improvement on a planar graph dataset. It is also the first model to scale to the large GuacaMol dataset containing 1.3M drug-like molecules without the use of molecule-specific representations.
    The Missing Indicator Method: From Low to High Dimensions. (arXiv:2211.09259v2 [cs.LG] UPDATED)
    Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records.
    Revisiting Discriminative vs. Generative Classifiers: Theory and Implications. (arXiv:2302.02334v1 [cs.LG])
    A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the ``two regimes'' phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.
    Heterogeneous Federated Knowledge Graph Embedding Learning and Unlearning. (arXiv:2302.02069v1 [cs.LG])
    Federated Learning (FL) recently emerges as a paradigm to train a global machine learning model across distributed clients without sharing raw data. Knowledge Graph (KG) embedding represents KGs in a continuous vector space, serving as the backbone of many knowledge-driven applications. As a promising combination, federated KG embedding can fully take advantage of knowledge learned from different clients while preserving the privacy of local data. However, realistic problems such as data heterogeneity and knowledge forgetting still remain to be concerned. In this paper, we propose FedLU, a novel FL framework for heterogeneous KG embedding learning and unlearning. To cope with the drift between local optimization and global convergence caused by data heterogeneity, we propose mutual knowledge distillation to transfer local knowledge to global, and absorb global knowledge back. Moreover, we present an unlearning method based on cognitive neuroscience, which combines retroactive interference and passive decay to erase specific knowledge from local clients and propagate to the global model by reusing knowledge distillation. We construct new datasets for assessing realistic performance of the state-of-the-arts. Extensive experiments show that FedLU achieves superior results in both link prediction and knowledge forgetting.
    How Many and Which Training Points Would Need to be Removed to Flip this Prediction?. (arXiv:2302.02169v1 [cs.LG])
    We consider the problem of identifying a minimal subset of training data $\mathcal{S}_t$ such that if the instances comprising $\mathcal{S}_t$ had been removed prior to training, the categorization of a given test point $x_t$ would have been different. Identifying such a set may be of interest for a few reasons. First, the cardinality of $\mathcal{S}_t$ provides a measure of robustness (if $|\mathcal{S}_t|$ is small for $x_t$, we might be less confident in the corresponding prediction), which we show is correlated with but complementary to predicted probabilities. Second, interrogation of $\mathcal{S}_t$ may provide a novel mechanism for contesting a particular model prediction: If one can make the case that the points in $\mathcal{S}_t$ are wrongly labeled or irrelevant, this may argue for overturning the associated prediction. Identifying $\mathcal{S}_t$ via brute-force is intractable. We propose comparatively fast approximation methods to find $\mathcal{S}_t$ based on influence functions, and find that -- for simple convex text classification models -- these approaches can often successfully identify relatively small sets of training examples which, if removed, would flip the prediction. To our knowledge, this is the first work in to investigate the problem of identifying a minimal training set necessary to flip a given prediction in the context of machine learning.
    Unsupervised Learning for Pilot-free Transmission in 3GPP MIMO Systems. (arXiv:2302.02191v1 [cs.IT])
    Reference signals overhead reduction has recently evolved as an effective solution for improving the system spectral efficiency. This paper introduces a new downlink data structure that is free from demodulation reference signals (DM-RS), and hence does not require any channel estimation at the receiver. The new proposed data transmission structure involves a simple repetition step of part of the user data across the different sub-bands. Exploiting the repetition structure at the user side, it is shown that reliable recovery is possible via canonical correlation analysis. This paper also proposes two effective mechanisms for boosting the CCA performance in OFDM systems; one for repetition pattern selection and another to deal with the severe frequency selectivity issues. The proposed approach exhibits favorable complexity-performance tradeoff, rendering it appealing for practical implementation. Numerical results, using a 3GPP link-level testbench, demonstrate the superiority of the proposed approach relative to the state-of-the-art methods.
    Euclidean-Norm-Induced Schatten-p Quasi-Norm Regularization for Low-Rank Tensor Completion and Tensor Robust Principal Component Analysis. (arXiv:2012.03436v4 [cs.LG] UPDATED)
    The nuclear norm and Schatten-$p$ quasi-norm are popular rank proxies in low-rank matrix recovery. However, computing the nuclear norm or Schatten-$p$ quasi-norm of a tensor is hard in both theory and practice, hindering their application to low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA). In this paper, we propose a new class of tensor rank regularizers based on the Euclidean norms of the CP component vectors of a tensor and show that these regularizers are monotonic transformations of tensor Schatten-$p$ quasi-norm. This connection enables us to minimize the Schatten-$p$ quasi-norm in LRTC and TRPCA implicitly via the component vectors. The method scales to big tensors and provides an arbitrarily sharper rank proxy for low-rank tensor recovery compared to the nuclear norm. On the other hand, we study the generalization abilities of LRTC with the Schatten-$p$ quasi-norm regularizer and LRTC with the proposed regularizers. The theorems show that a relatively sharper regularizer leads to a tighter error bound, which is consistent with our numerical results. Particularly, we prove that for LRTC with Schatten-$p$ quasi-norm regularizer on $d$-order tensors, $p=1/d$ is always better than any $p>1/d$ in terms of the generalization ability. We also provide a recovery error bound to verify the usefulness of small $p$ in the Schatten-$p$ quasi-norm for TRPCA. Numerical results on synthetic data and real data demonstrate the effectiveness of the regularization methods and theorems.
    Efficient Human-in-the-loop System for Guiding DNNs Attention. (arXiv:2206.05981v3 [cs.CV] UPDATED)
    Attention guidance is an approach to addressing dataset bias in deep learning, where the model relies on incorrect features to make decisions. Focusing on image classification tasks, we propose an efficient human-in-the-loop system to interactively direct the attention of classifiers to the regions specified by users, thereby reducing the influence of co-occurrence bias and improving the transferability and interpretability of a DNN. Previous approaches for attention guidance require the preparation of pixel-level annotations and are not designed as interactive systems. We present a new interactive method to allow users to annotate images with simple clicks, and study a novel active learning strategy to significantly reduce the number of annotations. We conducted both a numerical evaluation and a user study to evaluate the proposed system on multiple datasets. Compared to the existing non-active-learning approach which usually relies on huge amounts of polygon-based segmentation masks to fine-tune or train the DNNs, our system can save lots of labor and money and obtain a fine-tuned network that works better even when the dataset is biased. The experiment results indicate that the proposed system is efficient, reasonable, and reliable.
    Global Optimization with Parametric Function Approximation. (arXiv:2211.09100v2 [cs.LG] UPDATED)
    We consider the problem of global optimization with noisy zeroth order oracles - a well-motivated problem useful for various applications ranging from hyper-parameter tuning for deep learning to new material design. Existing work relies on Gaussian processes or other non-parametric family, which suffers from the curse of dimensionality. In this paper, we propose a new algorithm GO-UCB that leverages a parametric family of functions (e.g., neural networks) instead. Under a realizable assumption and a few other mild geometric conditions, we show that GO-UCB achieves a cumulative regret of $\tilde{O}(\sqrt{T})$ where $T$ is the time horizon. At the core of GO-UCB is a carefully designed uncertainty set over parameters based on gradients that allows optimistic exploration. Synthetic and real-world experiments illustrate GO-UCB works better than Bayesian optimization approaches in high dimensional cases, even if the model is misspecified.
    Predicting the power grid frequency of European islands. (arXiv:2209.15414v2 [stat.AP] UPDATED)
    Modelling, forecasting and overall understanding of the dynamics of the power grid and its frequency are essential for the safe operation of existing and future power grids. Much previous research was focused on large continental areas, while small systems, such as islands are less well-studied. These natural island systems are ideal testing environments for microgrid proposals and artificially islanded grid operation. In the present paper, we utilize measurements of the power grid frequency obtained in European islands: the Faroe Islands, Ireland, the Balearic Islands and Iceland and investigate how their frequency can be predicted, compared to the Nordic power system, acting as a reference. The Balearic islands are found to be particularly deterministic and easy to predict in contrast to hard-to-predict Iceland. Furthermore, we show that typically 2-4 weeks of data are needed to improve prediction performance beyond simple benchmarks.
    Learning Solution Manifolds for Control Problems via Energy Minimization. (arXiv:2203.03432v2 [cs.RO] UPDATED)
    A variety of control tasks such as inverse kinematics (IK), trajectory optimization (TO), and model predictive control (MPC) are commonly formulated as energy minimization problems. Numerical solutions to such problems are well-established. However, these are often too slow to be used directly in real-time applications. The alternative is to learn solution manifolds for control problems in an offline stage. Although this distillation process can be trivially formulated as a behavioral cloning (BC) problem in an imitation learning setting, our experiments highlight a number of significant shortcomings arising due to incompatible local minima, interpolation artifacts, and insufficient coverage of the state space. In this paper, we propose an alternative to BC that is efficient and numerically robust. We formulate the learning of solution manifolds as a minimization of the energy terms of a control objective integrated over the space of problems of interest. We minimize this energy integral with a novel method that combines Monte Carlo-inspired adaptive sampling strategies with the derivatives used to solve individual instances of the control task. We evaluate the performance of our formulation on a series of robotic control problems of increasing complexity, and we highlight its benefits through comparisons against traditional methods such as behavioral cloning and Dataset aggregation (Dagger).
    EGC2: Enhanced Graph Classification with Easy Graph Compression. (arXiv:2107.07737v2 [cs.LG] UPDATED)
    Graph classification is crucial in network analyses. Networks face potential security threats, such as adversarial attacks. Some defense methods may trade off the algorithm complexity for robustness, such as adversarial training, whereas others may trade off clean example performance, such as smoothingbased defense. Most suffer from high complexity or low transferability. To address this problem, we proposed EGC2, an enhanced graph classification model with easy graph compression. EGC2 captures the relationship between the features of different nodes by constructing feature graphs and improving the aggregation of the node-level representations. To achieve lower-complexity defense applied to graph classification models, EGC2 utilizes a centrality-based edge-importance index to compress the graphs, filtering out trivial structures and adversarial perturbations in the input graphs, thus improving the model's robustness. Experiments on ten benchmark datasets demonstrate that the proposed feature read-out and graph compression mechanisms enhance the robustness of multiple basic models, resulting in a state-of-the-art performance in terms of accuracy and robustness against various adversarial attacks.
    On the complexity of nonsmooth automatic differentiation. (arXiv:2206.01730v2 [math.NA] UPDATED)
    Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.
    Selecting the Best Optimizers for Deep Learning based Medical Image Segmentation. (arXiv:2302.02289v1 [eess.IV])
    The goal of this work is to identify the best optimizers for deep learning in the context of cardiac image segmentation and to provide guidance on how to design segmentation networks with effective optimization strategies. Adaptive learning helps with fast convergence by starting with a larger learning rate (LR) and gradually decreasing it. Momentum optimizers are particularly effective at quickly optimizing neural networks within the accelerated schemes category. By revealing the potential interplay between these two types of algorithms (LR and momentum optimizers or momentum rate (MR) in short), in this article, we explore the two variants of SGD algorithms in a single setting. We suggest using cyclic learning as the base optimizer and integrating optimal values of learning rate and momentum rate. We investigated the relationship of LR and MR under an important problem of medical image segmentation of cardiac structures from MRI and CT scans. We conducted experiments using the cardiac imaging dataset from the ACDC challenge of MICCAI 2017, and four different architectures shown to be successful for cardiac image segmentation problems. Our comprehensive evaluations demonstrated that the proposed optimizer achieved better results (over a 2\% improvement in the dice metric) than other optimizers in deep learning literature with similar or lower computational cost in both single and multi-object segmentation settings. We hypothesized that combination of accelerated and adaptive optimization methods can have a drastic effect in medical image segmentation performances. To this end, we proposed a new cyclic optimization method (\textit{CLMR}) to address the efficiency and accuracy problems in deep learning based medical image segmentation. The proposed strategy yielded better generalization in comparison to adaptive optimizers.
    Proper Learning of Linear Dynamical Systems as a Non-Commutative Polynomial Optimisation Problem. (arXiv:2002.01444v4 [math.OC] UPDATED)
    There has been much recent progress in forecasting the next observation of a linear dynamical system (LDS), which is known as the improper learning, as well as in the estimation of its system matrices, which is known as the proper learning of LDS. We present an approach to proper learning of LDS, which in spite of the non-convexity of the problem, guarantees global convergence of numerical solutions to a least-squares estimator. We present promising computational results.
    Exploring validation metrics for offline model-based optimisation. (arXiv:2211.10747v2 [stat.ML] UPDATED)
    In offline model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of desirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and apply this to conditional denoising diffusion models. Empirically, we find that two validation metrics -- agreement and Frechet distance -- correlate quite well with the ground truth. When there is high variability in conditional generation, feedback is required in the form of an approximated version of the real-world scoring function. Furthermore, we find that generating high-scoring samples may require heavily weighting the generative model in favour of sample quality, potentially at the cost of sample diversity.
    Reducing Nearest Neighbor Training Sets Optimally and Exactly. (arXiv:2302.02132v1 [cs.CG])
    In nearest-neighbor classification, a training set $P$ of points in $\mathbb{R}^d$ with given classification is used to classify every point in $\mathbb{R}^d$: Every point gets the same classification as its nearest neighbor in $P$. Recently, Eppstein [SOSA'22] developed an algorithm to detect the relevant training points, those points $p\in P$, such that $P$ and $P\setminus\{p\}$ induce different classifications. We investigate the problem of finding the minimum cardinality reduced training set $P'\subseteq P$ such that $P$ and $P'$ induce the same classification. We show that the set of relevant points is such a minimum cardinality reduced training set if $P$ is in general position. Furthermore, we show that finding a minimum cardinality reduced training set for possibly degenerate $P$ is in P for $d=1$, and NP-complete for $d\geq 2$.
    Maximizing Global Model Appeal in Federated Learning. (arXiv:2205.14840v2 [cs.LG] UPDATED)
    Federated learning typically considers collaboratively training a global model using local data at edge clients. Clients may have their own individual requirements, such as having a minimal training loss threshold, which they expect to be met by the global model. However, due to client heterogeneity, the global model may not meet each client's requirements, and only a small subset may find the global model appealing. In this work, we explore the problem of the global model lacking appeal to the clients due to not being able to satisfy local requirements. We propose MaxFL, which aims to maximize the number of clients that find the global model appealing. We show that having a high global model appeal is important to maintain an adequate pool of clients for training, and can directly improve the test accuracy on both seen and unseen clients. We provide convergence guarantees for MaxFL and show that MaxFL achieves a $22$-$40\%$ and $18$-$50\%$ test accuracy improvement for the training clients and unseen clients respectively, compared to a wide range of FL modeling approaches, including those that tackle data heterogeneity, aim to incentivize clients, and learn personalized or fair models.
    Robust Empirical Risk Minimization with Tolerance. (arXiv:2210.00635v2 [cs.LG] UPDATED)
    Developing simple, sample-efficient learning algorithms for robust classification is a pressing issue in today's tech-dominated world, and current theoretical techniques requiring exponential sample complexity and complicated improper learning rules fall far from answering the need. In this work we study the fundamental paradigm of (robust) $\textit{empirical risk minimization}$ (RERM), a simple process in which the learner outputs any hypothesis minimizing its training error. RERM famously fails to robustly learn VC classes (Montasser et al., 2019a), a bound we show extends even to `nice' settings such as (bounded) halfspaces. As such, we study a recent relaxation of the robust model called $\textit{tolerant}$ robust learning (Ashtiani et al., 2022) where the output classifier is compared to the best achievable error over slightly larger perturbation sets. We show that under geometric niceness conditions, a natural tolerant variant of RERM is indeed sufficient for $\gamma$-tolerant robust learning VC classes over $\mathbb{R}^d$, and requires only $\tilde{O}\left( \frac{VC(H)d\log \frac{D}{\gamma\delta}}{\epsilon^2}\right)$ samples for robustness regions of (maximum) diameter $D$.
    Improving Fair Training under Correlation Shifts. (arXiv:2302.02323v1 [cs.LG])
    Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make two contributions for solving this problem. First, we analytically show that existing in-processing fair algorithms have fundamental limits in accuracy and group fairness. We introduce the notion of correlation shifts, which can explicitly capture the change of the above bias. Second, we propose a novel pre-processing step that samples the input data to reduce correlation shifts and thus enables the in-processing approaches to overcome their limitations. We formulate an optimization problem for adjusting the data ratio among labels and sensitive groups to reflect the shifted correlation. A key benefit of our approach lies in decoupling the roles of pre- and in-processing approaches: correlation adjustment via pre-processing and unfairness mitigation on the processed data via in-processing. Experiments show that our framework effectively improves existing in-processing fair algorithms w.r.t. accuracy and fairness, both on synthetic and real datasets.
    A Characteristic Function for Shapley-Value-Based\\Attribution of Anomaly Scores. (arXiv:2004.04464v2 [cs.LG] UPDATED)
    In anomaly detection, the degree of irregularity is often summarized as a real-valued anomaly score. We address the problem of attributing such anomaly scores to input features for interpreting the results of anomaly detection. We particularly investigate the use of the Shapley value for attributing anomaly scores of semi-supervised detection methods. We propose a characteristic function specifically designed for attributing anomaly scores. The idea is to approximate the absence of some features by locally minimizing the anomaly score with regard to the to-be-absent features. We examine the applicability of the proposed characteristic function and other general approaches for interpreting anomaly scores on multiple datasets and multiple anomaly detection methods. The results indicate the potential utility of the attribution methods including the proposed one.
    Hierarchical Sliced Wasserstein Distance. (arXiv:2209.13570v5 [stat.ML] UPDATED)
    Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNet.
    Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs. (arXiv:2106.01128v2 [cs.LG] UPDATED)
    The ability to align points across two related yet incomparable point clouds (e.g. living in different spaces) plays an important role in machine learning. The Gromov-Wasserstein (GW) framework provides an increasingly popular answer to such problems, by seeking a low-distortion, geometry-preserving assignment between these points. As a non-convex, quadratic generalization of optimal transport (OT), GW is NP-hard. While practitioners often resort to solving GW approximately as a nested sequence of entropy-regularized OT problems, the cubic complexity (in the number $n$ of samples) of that approach is a roadblock. We show in this work how a recent variant of the OT problem that restricts the set of admissible couplings to those having a low-rank factorization is remarkably well suited to the resolution of GW: when applied to GW, we show that this approach is not only able to compute a stationary point of the GW problem in time $O(n^2)$, but also uniquely positioned to benefit from the knowledge that the initial cost matrices are low-rank, to yield a linear time $O(n)$ GW approximation. Our approach yields similar results, yet orders of magnitude faster computation than the SoTA entropic GW approaches, on both simulated and real data.
    Deep Graph-Level Clustering Using Pseudo-Label-Guided Mutual Information Maximization Network. (arXiv:2302.02369v1 [cs.LG])
    In this work, we study the problem of partitioning a set of graphs into different groups such that the graphs in the same group are similar while the graphs in different groups are dissimilar. This problem was rarely studied previously, although there have been a lot of work on node clustering and graph classification. The problem is challenging because it is difficult to measure the similarity or distance between graphs. One feasible approach is using graph kernels to compute a similarity matrix for the graphs and then performing spectral clustering, but the effectiveness of existing graph kernels in measuring the similarity between graphs is very limited. To solve the problem, we propose a novel method called Deep Graph-Level Clustering (DGLC). DGLC utilizes a graph isomorphism network to learn graph-level representations by maximizing the mutual information between the representations of entire graphs and substructures, under the regularization of a clustering module that ensures discriminative representations via pseudo labels. DGLC achieves graph-level representation learning and graph-level clustering in an end-to-end manner. The experimental results on six benchmark datasets of graphs show that our DGLC has state-of-the-art performance in comparison to many baselines.
    Sparse GCA and Thresholded Gradient Descent. (arXiv:2107.00371v2 [stat.ML] UPDATED)
    Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple datasets. It generalizes canonical correlation analysis that is designed for two datasets. We study sparse GCA when there are potentially multiple generalized correlation tuples in data and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as generalized eigenvalue problems at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic datasets.
    CCSL: A Causal Structure Learning Method from Multiple Unknown Environments. (arXiv:2111.09666v2 [cs.LG] UPDATED)
    Most existing causal structure learning methods assume data collected from one environment and independent and identically distributed (i.i.d.). In some cases, data are collected from different subjects from multiple environments, which provides more information but might make the data non-identical or non-independent distribution. Some previous efforts try to learn causal structure from this type of data in two independent stages, i.e., first discovering i.i.d. groups from non-i.i.d. samples, then learning the causal structures from different groups. This straightforward solution ignores the intrinsic connections between the two stages, that is both the clustering stage and the learning stage should be guided by the same causal mechanism. Towards this end, we propose a unified Causal Cluster Structures Learning (named CCSL) method for causal discovery from non-i.i.d. data. This method simultaneously integrates the following two tasks: 1) clustering samples of the subjects with the same causal mechanism into different groups; 2) learning causal structures from the samples within the group. Specifically, for the former, we provide a Causality-related Chinese Restaurant Process to cluster samples based on the similarity of the causal structure; for the latter, we introduce a variational-inference-based approach to learn the causal structures. Theoretical results provide identification of the causal model and the clustering model under the linear non-Gaussian assumption. Experimental results on both simulated and real-world data further validate the correctness and effectiveness of the proposed method.
    Human alignment of neural network representations. (arXiv:2211.01201v3 [cs.CV] UPDATED)
    Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses. We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses, whereas the training dataset and objective function both have a much larger impact. These findings are consistent across three datasets of human similarity judgments collected using two different tasks. Linear transformations of neural network representations learned from behavioral responses from one dataset substantially improve alignment with human similarity judgments on the other two datasets. In addition, we find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
    Bayesian Fixed-Budget Best-Arm Identification. (arXiv:2211.08572v2 [cs.LG] UPDATED)
    Fixed-budget best-arm identification (BAI) is a bandit problem where the agent maximizes the probability of identifying the optimal arm within a fixed budget of observations. In this work, we study this problem in the Bayesian setting. We propose a Bayesian elimination algorithm and derive an upper bound on its probability of misidentifying the optimal arm. The bound reflects the quality of the prior and is the first distribution-dependent bound in this setting. We prove it using a frequentist-like argument, where we carry the prior through, and then integrate out the bandit instance at the end. We also provide the first lower bound on the probability of misidentification in a $2$-armed Bayesian bandit and show that our upper bound (almost) matches the lower bound. Our experiments show that Bayesian elimination is superior to frequentist methods and competitive with the state-of-the-art Bayesian algorithms that have no guarantees in our setting.
    Equi-Tuning: Group Equivariant Fine-Tuning of Pretrained Models. (arXiv:2210.06475v2 [cs.LG] UPDATED)
    We introduce equi-tuning, a novel fine-tuning method that transforms (potentially non-equivariant) pretrained models into group equivariant models while incurring minimum $L_2$ loss between the feature representations of the pretrained and the equivariant models. Large pretrained models can be equi-tuned for different groups to satisfy the needs of various downstream tasks. Equi-tuned models benefit from both group equivariance as an inductive bias and semantic priors from pretrained models. We provide applications of equi-tuning on three different tasks: image classification, compositional generalization in language, and fairness in natural language generation (NLG). We also provide a novel group-theoretic definition for fairness in NLG. The effectiveness of this definition is shown by testing it against a standard empirical method of fairness in NLG. We provide experimental results for equi-tuning using a variety of pretrained models: Alexnet, Resnet, VGG, and Densenet for image classification; RNNs, GRUs, and LSTMs for compositional generalization; and GPT2 for fairness in NLG. We test these models on benchmark datasets across all considered tasks to show the generality and effectiveness of the proposed method.
    Multi-Domain Long-Tailed Learning by Augmenting Disentangled Representations. (arXiv:2210.14358v2 [cs.LG] UPDATED)
    There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. Existing long-tailed classification methods focus on the single-domain setting, where all examples are drawn from the same distribution. However, real-world scenarios often involve multiple domains with distinct imbalanced class distributions. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, which produces invariant predictors by balanced augmenting hidden representations over domains and classes. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domain-invariant class prototype that averages out domain-specific effects. We evaluate TALLY on four long-tailed variants of classical domain generalization benchmarks and two real-world imbalanced multi-domain datasets. The results indicate that TALLY consistently outperforms other state-of-the-art methods in both subpopulation shift and domain shift.
    Global Context Vision Transformers. (arXiv:2206.09959v4 [cs.CV] UPDATED)
    We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision tasks. The core of the novel model are global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, as an alternative to complex operations such as an attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. In addition, we address the lack of inductive bias in ViTs and improve the modeling of inter-channel dependencies by proposing a novel downsampler which leverages a parameter-efficient fused inverted residual block. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, GC ViT models with 51M, 90M and 201M parameters achieve 84.3%, 84.9% and 85.6% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins.
    Gender Bias in Fake News: An Analysis. (arXiv:2209.11984v3 [cs.CY] UPDATED)
    Data science research into fake news has gathered much momentum in recent years, arguably facilitated by the emergence of large public benchmark datasets. While it has been well-established within media studies that gender bias is an issue that pervades news media, there has been very little exploration into the relationship between gender bias and fake news. In this work, we provide the first empirical analysis of gender bias vis-a-vis fake news, leveraging simple and transparent lexicon-based methods over public benchmark datasets. Our analysis establishes the increased prevalance of gender bias in fake news across three facets viz., abundance, affect and proximal words. The insights from our analysis provide a strong argument that gender bias needs to be an important consideration in research into fake news.
    A Permutation-free Kernel Two-Sample Test. (arXiv:2211.14908v2 [stat.ME] UPDATED)
    The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance metric between distributions that has found utility in two-sample testing. The usual kernel-MMD test statistic is a degenerate U-statistic under the null, and thus it has an intractable limiting distribution. Hence, to design a level-$\alpha$ test, one usually selects the rejection threshold as the $(1-\alpha)$-quantile of the permutation distribution. The resulting nonparametric test has finite-sample validity but suffers from large computational cost, since every permutation takes quadratic time. We propose the cross-MMD, a new quadratic-time MMD test statistic based on sample-splitting and studentization. We prove that under mild assumptions, the cross-MMD has a limiting standard Gaussian distribution under the null. Importantly, we also show that the resulting test is consistent against any fixed alternative, and when using the Gaussian kernel, it has minimax rate-optimal power against local alternatives. For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.
    Deep Latent State Space Models for Time-Series Generation. (arXiv:2212.12749v3 [stat.ML] UPDATED)
    Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being x100 faster than other baselines on long sequences.
    Uncovering Adversarial Risks of Test-Time Adaptation. (arXiv:2301.12576v2 [cs.LG] UPDATED)
    Recently, test-time adaptation (TTA) has been proposed as a promising solution for addressing distribution shifts. It allows a base model to adapt to an unforeseen distribution during inference by leveraging the information from the batch of (unlabeled) test data. However, we uncover a novel security vulnerability of TTA based on the insight that predictions on benign samples can be impacted by malicious samples in the same batch. To exploit this vulnerability, we propose Distribution Invading Attack (DIA), which injects a small fraction of malicious data into the test batch. DIA causes models using TTA to misclassify benign and unperturbed test data, providing an entirely new capability for adversaries that is infeasible in canonical machine learning pipelines. Through comprehensive evaluations, we demonstrate the high effectiveness of our attack on multiple benchmarks across six TTA methods. In response, we investigate two countermeasures to robustify the existing insecure TTA implementations, following the principle of "security by design". Together, we hope our findings can make the community aware of the utility-security tradeoffs in deploying TTA and provide valuable insights for developing robust TTA approaches.
    Self-supervised Semi-implicit Graph Variational Auto-encoders with Masking. (arXiv:2301.12458v2 [cs.LG] UPDATED)
    Generative graph self-supervised learning (SSL) aims to learn node representations by reconstructing the input graph data. However, most existing methods focus on unsupervised learning tasks only and very few work has shown its superiority over the state-of-the-art graph contrastive learning (GCL) models, especially on the classification task. While a very recent model has been proposed to bridge the gap, its performance on unsupervised learning tasks is still unknown. In this paper, to comprehensively enhance the performance of generative graph SSL against other GCL models on both unsupervised and supervised learning tasks, we propose the SeeGera model, which is based on the family of self-supervised variational graph auto-encoder (VGAE). Specifically, SeeGera adopts the semi-implicit variational inference framework, a hierarchical variational framework, and mainly focuses on feature reconstruction and structure/feature masking. On the one hand, SeeGera co-embeds both nodes and features in the encoder and reconstructs both links and features in the decoder. Since feature embeddings contain rich semantic information on features, they can be combined with node embeddings to provide fine-grained knowledge for feature reconstruction. On the other hand, SeeGera adds an additional layer for structure/feature masking to the hierarchical variational framework, which boosts the model generalizability. We conduct extensive experiments comparing SeeGera with 9 other state-of-the-art competitors. Our results show that SeeGera can compare favorably against other state-of-the-art GCL methods in a variety of unsupervised and supervised learning tasks.
    Beyond NaN: Resiliency of Optimization Layers in The Face of Infeasibility. (arXiv:2202.06242v2 [cs.LG] UPDATED)
    Prior work has successfully incorporated optimization layers as the last layer in neural networks for various problems, thereby allowing joint learning and planning in one neural network forward pass. In this work, we identify a weakness in such a set-up where inputs to the optimization layer lead to undefined output of the neural network. Such undefined decision outputs can lead to possible catastrophic outcomes in critical real time applications. We show that an adversary can cause such failures by forcing rank deficiency on the matrix fed to the optimization layer which results in the optimization failing to produce a solution. We provide a defense for the failure cases by controlling the condition number of the input matrix. We study the problem in the settings of synthetic data, Jigsaw Sudoku, and in speed planning for autonomous driving, building on top of prior frameworks in end-to-end learning and optimization. We show that our proposed defense effectively prevents the framework from failing with undefined output. Finally, we surface a number of edge cases which lead to serious bugs in popular equation and optimization solvers which can be abused as well.
    FairMILE: A Multi-Level Framework for Fair and Scalable Graph Representation Learning. (arXiv:2211.09925v2 [cs.LG] UPDATED)
    Graph representation learning models have been deployed for making decisions in multiple high-stakes scenarios. It is therefore critical to ensure that these models are fair. Prior research has shown that graph neural networks can inherit and reinforce the bias present in graph data. Researchers have begun to examine ways to mitigate the bias in such models. However, existing efforts are restricted by their inefficiency, limited applicability, and the constraints they place on sensitive attributes. To address these issues, we present FairMILE a general framework for fair and scalable graph representation learning. FairMILE is a multi-level framework that allows contemporary unsupervised graph embedding methods to scale to large graphs in an agnostic manner. FairMILE learns both fair and high-quality node embeddings where the fairness constraints are incorporated in each phase of the framework. Our experiments across two distinct tasks demonstrate that FairMILE can learn node representations that often achieve superior fairness scores and high downstream performance while significantly outperforming all the baselines in terms of efficiency.
    Benchmarking optimality of time series classification methods in distinguishing diffusions. (arXiv:2301.13112v2 [stat.ML] UPDATED)
    Performance benchmarking is a crucial component of time series classification (TSC) algorithm design, and a fast-growing number of datasets have been established for empirical benchmarking. However, the empirical benchmarks are costly and do not guarantee statistical optimality. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is optimal in the sense of the Neyman-Pearson lemma: it has the smallest false positive rate among classifiers with a controlled level of false negative rate. The LRT requires the likelihood ratio of the time series to be computable. The diffusion processes from stochastic differential equations provide such time series and are flexible in design for generating linear or nonlinear time series. We demonstrate the benchmarking with three scalable state-of-the-art TSC algorithms: random forest, ResNet, and ROCKET. Test results show that they can achieve LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying nonlinear multivariate time series from high-dimensional stochastic interacting particle systems. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series. Thus, the LRT with diffusion processes can systematically and efficiently benchmark the optimality of TSC algorithms and may guide their future improvements.
    Harnessing Simulation for Molecular Embeddings. (arXiv:2302.02055v1 [cs.LG])
    While deep learning has unlocked advances in computational biology once thought to be decades away, extending deep learning techniques to the molecular domain has proven challenging, as labeled data is scarce and the benefit from self-supervised learning can be negligible in many cases. In this work, we explore a different approach. Inspired by methods in deep reinforcement learning and robotics, we explore harnessing physics-based molecular simulation to develop molecular embeddings. By fitting a Graph Neural Network to simulation data, molecules that display similar interactions with biological targets under simulation develop similar representations in the embedding space. These embeddings can then be used to initialize the feature space of down-stream models trained on real-world data to encode information learned during simulation into a molecular prediction task. Our experimental findings indicate this approach improves the performance of existing deep learning models on real-world molecular prediction tasks by as much as 38% with minimal modification to the downstream model and no hyperparameter tuning.
    Neural Optimal Transport. (arXiv:2201.12220v2 [cs.LG] UPDATED)
    We present a novel neural-networks-based algorithm to compute optimal transport maps and plans for strong and weak transport costs. To justify the usage of neural networks, we prove that they are universal approximators of transport plans between probability distributions. We evaluate the performance of our optimal transport algorithm on toy examples and on the unpaired image-to-image translation.
    Introspective Experience Replay: Look Back When Surprised. (arXiv:2206.03171v4 [cs.LG] UPDATED)
    In reinforcement learning (RL), experience replay-based sampling techniques play a crucial role in promoting convergence by eliminating spurious correlations. However, widely used methods such as uniform experience replay (UER) and prioritized experience replay (PER) have been shown to have sub-optimal convergence and high seed sensitivity respectively. To address these issues, we propose a novel approach called IntrospectiveExperience Replay (IER) that selectively samples batches of data points prior to surprising events. Our method builds upon the theoretically sound reverse experience replay (RER) technique, which has been shown to reduce bias in the output of Q-learning-type algorithms with linear function approximation. However, this approach is not always practical or reliable when using neural function approximation. Through empirical evaluations, we demonstrate that IER with neural function approximation yields reliable and superior performance compared toUER, PER, and hindsight experience replay (HER) across most tasks.
    Confidence-Ranked Reconstruction of Census Microdata from Published Statistics. (arXiv:2211.03128v2 [cs.CY] UPDATED)
    A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.
    Reliable Conditioning of Behavioral Cloning for Offline Reinforcement Learning. (arXiv:2210.05158v2 [cs.LG] UPDATED)
    Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. While promising, we show that these methods can be unreliable, as their performance may degrade significantly when conditioned on high, out-of-distribution (ood) returns. This is crucial in practice, as we often expect the policy to perform better than the offline dataset by conditioning on an ood value. We show that this unreliability arises from both the suboptimality of training data and model architectures. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the reliability of conditional BC with two key components: trajectory weighting and conservative regularization. Trajectory weighting upweights the high-return trajectories to reduce the train-test gap for BC methods, while conservative regularizer encourages the policy to stay close to the data distribution for ood conditioning. We study CWBC in the context of RvS (Emmons et al., 2021) and Decision Transformers (Chen et al., 2021), and show that CWBC significantly boosts their performance on various benchmarks.
    Re-parameterizing Your Optimizers rather than Architectures. (arXiv:2205.15242v3 [cs.LG] UPDATED)
    The well-designed structures in neural networks reflect the prior knowledge incorporated into the models. However, though different models have various priors, we are used to training them with model-agnostic optimizers such as SGD. In this paper, we propose to incorporate model-specific prior knowledge into optimizers by modifying the gradients according to a set of model-specific hyper-parameters. Such a methodology is referred to as Gradient Re-parameterization, and the optimizers are named RepOptimizers. For the extreme simplicity of model structure, we focus on a VGG-style plain model and showcase that such a simple model trained with a RepOptimizer, which is referred to as RepOpt-VGG, performs on par with or better than the recent well-designed models. From a practical perspective, RepOpt-VGG is a favorable base model because of its simple structure, high inference speed and training efficiency. Compared to Structural Re-parameterization, which adds priors into models via constructing extra training-time structures, RepOptimizers require no extra forward/backward computations and solve the problem of quantization. We hope to spark further research beyond the realms of model structure design. Code and models \url{https://github.com/DingXiaoH/RepOptimizers}.
    Flow-matching -- efficient coarse-graining of molecular dynamics without forces. (arXiv:2203.11167v4 [physics.comp-ph] UPDATED)
    Coarse-grained (CG) molecular simulations have become a standard tool to study molecular processes on time- and length-scales inaccessible to all-atom simulations. Parameterizing CG force fields to match all-atom simulations has mainly relied on force-matching or relative entropy minimization, which require many samples from costly simulations with all-atom or CG resolutions, respectively. Here we present flow-matching, a new training method for CG force fields that combines the advantages of both methods by leveraging normalizing flows, a generative deep learning method. Flow-matching first trains a normalizing flow to represent the CG probability density, which is equivalent to minimizing the relative entropy without requiring iterative CG simulations. Subsequently, the flow generates samples and forces according to the learned distribution in order to train the desired CG free energy model via force matching. Even without requiring forces from the all-atom simulations, flow-matching outperforms classical force-matching by an order of magnitude in terms of data efficiency, and produces CG models that can capture the folding and unfolding transitions of small proteins.
    Tensor Decomposition of Large-scale Clinical EEGs Reveals Interpretable Patterns of Brain Physiology. (arXiv:2211.13793v2 [eess.SP] UPDATED)
    Identifying abnormal patterns in electroencephalography (EEG) remains the cornerstone of diagnosing several neurological diseases. The current clinical EEG review process relies heavily on expert visual review, which is unscalable and error-prone. In an effort to augment the expert review process, there is a significant interest in mining population-level EEG patterns using unsupervised approaches. Current approaches rely either on two-dimensional decompositions (e.g., principal and independent component analyses) or deep representation learning (e.g., auto-encoders, self-supervision). However, most approaches do not leverage the natural multi-dimensional structure of EEGs and lack interpretability. In this study, we propose a tensor decomposition approach using the canonical polyadic decomposition to discover a parsimonious set of population-level EEG patterns, retaining the natural multi-dimensional structure of EEGs (time x space x frequency). We then validate their clinical value using a cohort of patients including varying stages of cognitive impairment. Our results show that the discovered patterns reflect physiologically meaningful features and accurately classify the stages of cognitive impairment (healthy vs mild cognitive impairment vs Alzheimer's dementia) with substantially fewer features compared to classical and deep learning-based baselines. We conclude that the decomposition of population-level EEG tensors recovers expert-interpretable EEG patterns that can aid in the study of smaller specialized clinical cohorts.
    Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v5 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,\delta)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to investigate individual privacy across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2% higher than that of the class with the highest accuracy.
    Continuous Forecasting via Neural Eigen Decomposition. (arXiv:2202.00117v3 [cs.LG] UPDATED)
    Neural differential equations predict the derivative of a stochastic process. This allows irregular forecasting with arbitrary time-steps. However, the expressive temporal flexibility often comes with a high sensitivity to noise. In addition, current methods model measurements and control together, limiting generalization to different control policies. These properties severely limit applicability to medical treatment problems, which require reliable forecasting given high noise, limited data and changing treatment policies. We introduce the Neural Eigen-SDE algorithm (NESDE), which relies on piecewise linear dynamics modeling with spectral representation. NESDE provides control over the expressiveness level; decoupling of control from measurements; and closed-form continuous prediction in inference. NESDE is demonstrated to provide robust forecasting in both synthetic and real high-noise medical problems. Finally, we use the learned dynamics models to publish simulated medical gym environments.
    Projecting Non-Fungible Token (NFT) Collections: A Contextual Generative Approach. (arXiv:2210.15493v2 [q-fin.CP] UPDATED)
    Non-fungible tokens (NFTs) are digital assets stored on a blockchain representing real-world objects such as art or collectibles. An NFT collection comprises numerous tokens; each token can be transacted multiple times. It is a multibillion-dollar market where the number of collections has more than doubled in 2022. In this paper, we want to obtain a generative model that, given the early transactions history (first quarter Q1) of a newly minted collection, generates subsequent transactions (quarters Q2, Q3, Q4), where the generative model is trained using the transaction history of a few mature collections. The goal is to use the generated transactions to project the potential market value of this newly minted collection over the next few quarters. A technical challenge exists in that different collections have diverse characteristics, and the generative model should generate based on the appropriate "contexts" of the collection. Our method takes a two-step approach. First, it employs unsupervised learning on the early transactions to extract characteristics (which we call contexts) of NFT collections. Next, it generates future transactions of each token based on these contexts and the early transactions, projecting the target collection's potential market value. Comprehensive experiments demonstrate our contextual generative approach's NFT projection capabilities.
    Achieve the Minimum Width of Neural Networks for Universal Approximation. (arXiv:2209.11395v2 [cs.LG] UPDATED)
    The universal approximation property (UAP) of neural networks is fundamental for deep learning, and it is well known that wide neural networks are universal approximators of continuous functions within both the $L^p$ norm and the continuous/uniform norm. However, the exact minimum width, $w_{\min}$, for the UAP has not been studied thoroughly. Recently, using a decoder-memorizer-encoder scheme, \citet{Park2021Minimum} found that $w_{\min} = \max(d_x+1,d_y)$ for both the $L^p$-UAP of ReLU networks and the $C$-UAP of ReLU+STEP networks, where $d_x,d_y$ are the input and output dimensions, respectively. In this paper, we consider neural networks with an arbitrary set of activation functions. We prove that both $C$-UAP and $L^p$-UAP for functions on compact domains share a universal lower bound of the minimal width; that is, $w^*_{\min} = \max(d_x,d_y)$. In particular, the critical width, $w^*_{\min}$, for $L^p$-UAP can be achieved by leaky-ReLU networks, provided that the input or output dimension is larger than one. Our construction is based on the approximation power of neural ordinary differential equations and the ability to approximate flow maps by neural networks. The nonmonotone or discontinuous activation functions case and the one-dimensional case are also discussed.
    Offline Equilibrium Finding. (arXiv:2207.05285v2 [cs.AI] UPDATED)
    Offline reinforcement learning (offline RL) is an emerging field that has recently begun gaining attention across various application domains due to its ability to learn strategies from earlier collected datasets. Offline RL proved very successful, paving a path to solving previously intractable real-world problems, and we aim to generalize this paradigm to a multiplayer-game setting. To this end, we introduce a problem of offline equilibrium finding (OEF) and construct multiple types of datasets across a wide range of games using several established methods. To solve the OEF problem, we design a model-based framework that can directly apply any online equilibrium finding algorithm to the OEF setting while making minimal changes. The three most prominent contemporary online equilibrium finding algorithms are adapted to the context of OEF, creating three model-based variants: OEF-PSRO and OEF-CFR, which generalize the widely-used algorithms PSRO and Deep CFR to compute Nash equilibria (NEs), and OEF-JPSRO, which generalizes the JPSRO to calculate (Coarse) Correlated equilibria ((C)CEs). We also combine the behavior cloning policy with the model-based policy to further improve the performance and provide a theoretical guarantee of the solution quality. Extensive experimental results demonstrate the superiority of our approach over offline RL algorithms and the importance of using model-based methods for OEF problems. We hope our work will contribute to advancing research in large-scale equilibrium finding.
    Efficient Adaptive Activation Rounding for Post-Training Quantization. (arXiv:2208.11945v2 [cs.LG] UPDATED)
    Post-training quantization (PTQ) attracts increasing attention due to its convenience in deploying quantized neural networks. Rounding is the primary source of quantization error, for which previous works adopt the rounding-to-nearest scheme with a constant border of 0.5. This work demonstrates that optimizing rounding schemes can improve model accuracy. By replacing the constant border with a simple border function, we can obtain the minimal error for multiplying two numbers and eliminate the bias of its expected value, which further benefits model accuracy. Based on this insight, we approximate the border function to make the incurred overhead negligible. We also jointly optimize propagated errors and global errors. We finally propose our AQuant framework, which can learn the border function automatically. Extensive experiments show that AQuant achieves noticeable improvements compared with state-of-the-art works and pushes the accuracy of ResNet-18 up to 60.31% under the 2-bit weight and activation post-training quantization.
    Bidirectional Language Models Are Also Few-shot Learners. (arXiv:2209.14500v2 [cs.LG] UPDATED)
    Large language models such as GPT-3 (Brown et al., 2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning. To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm. We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models. Utilizing the machine translation task as a case study, we prompt the bidirectional mT5 model (Xue et al., 2021) with SAP and demonstrate its few-shot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM (Lin et al., 2021), despite mT5's approximately 50% fewer parameters. We further show SAP is effective on question answering and summarization. For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models, rather than only unidirectional models.
    Can Stochastic Gradient Langevin Dynamics Provide Differential Privacy for Deep Learning?. (arXiv:2110.05057v5 [cs.LG] UPDATED)
    Bayesian learning via Stochastic Gradient Langevin Dynamics (SGLD) has been suggested for differentially private learning. While previous research provides differential privacy bounds for SGLD at the initial steps of the algorithm or when close to convergence, the question of what differential privacy guarantees can be made in between remains unanswered. This interim region is of great importance, especially for Bayesian neural networks, as it is hard to guarantee convergence to the posterior. This paper shows that using SGLD might result in unbounded privacy loss for this interim region, even when sampling from the posterior is as differentially private as desired.
    On Best-Arm Identification with a Fixed Budget in Non-Parametric Multi-Armed Bandits. (arXiv:2210.00895v2 [cs.LG] UPDATED)
    We lay the foundations of a non-parametric theory of best-arm identification in multi-armed bandits with a fixed budget T. We consider general, possibly non-parametric, models D for distributions over the arms; an overarching example is the model D = P(0,1) of all probability distributions over [0,1]. We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities that correspond to infima over Kullback-Leibler divergences between some distributions in D and a given distribution. This is made possible by a refined analysis of the successive-rejects strategy of Audibert, Bubeck, and Munos (2010). We finally provide lower bounds on the same average log-probability, also in terms of the same new information-theoretic quantities; these lower bounds are larger when the (natural) assumptions on the considered strategies are stronger. All these new upper and lower bounds generalize existing bounds based, e.g., on gaps between distributions.
    Adaptive Perturbation-Based Gradient Estimation for Discrete Latent Variable Models. (arXiv:2209.04862v2 [cs.LG] UPDATED)
    The integration of discrete algorithmic components in deep learning architectures has numerous applications. Recently, Implicit Maximum Likelihood Estimation (IMLE, Niepert, Minervini, and Franceschi 2021), a class of gradient estimators for discrete exponential family distributions, was proposed by combining implicit differentiation through perturbation with the path-wise gradient estimator. However, due to the finite difference approximation of the gradients, it is especially sensitive to the choice of the finite difference step size, which needs to be specified by the user. In this work, we present Adaptive IMLE (AIMLE), the first adaptive gradient estimator for complex discrete distributions: it adaptively identifies the target distribution for IMLE by trading off the density of gradient information with the degree of bias in the gradient estimates. We empirically evaluate our estimator on synthetic examples, as well as on Learning to Explain, Discrete Variational Auto-Encoders, and Neural Relational Inference tasks. In our experiments, we show that our adaptive gradient estimator can produce faithful estimates while requiring orders of magnitude fewer samples than other gradient estimators.
    Run-Off Election: Improved Provable Defense against Data Poisoning Attacks. (arXiv:2302.02300v1 [cs.LG])
    In data poisoning attacks, an adversary tries to change a model's prediction by adding, modifying, or removing samples in the training data. Recently, ensemble-based approaches for obtaining provable defenses against data poisoning have been proposed where predictions are done by taking a majority vote across multiple base models. In this work, we show that merely considering the majority vote in ensemble defenses is wasteful as it does not effectively utilize available information in the logits layers of the base models. Instead, we propose Run-Off Election (ROE), a novel aggregation method based on a two-round election across the base models: In the first round, models vote for their preferred class and then a second, Run-Off election is held between the top two classes in the first round. Based on this approach, we propose DPA+ROE and FA+ROE defense methods based on Deep Partition Aggregation (DPA) and Finite Aggregation (FA) approaches from prior work. We show how to obtain robustness for these methods using ideas inspired by dynamic programming and duality. We evaluate our methods on MNIST, CIFAR-10, and GTSRB and obtain improvements in certified accuracy by up to 4.73%, 3.63%, and 3.54%, respectively, establishing a new state-of-the-art in (pointwise) certified robustness against data poisoning. In many cases, our approach outperforms the state-of-the-art, even when using 32 times less computational power.
    Transferable E(3) equivariant parameterization for Hamiltonian of molecules and solids. (arXiv:2210.16190v2 [physics.comp-ph] UPDATED)
    Using the message-passing mechanism in machine learning (ML) instead of self-consistent iterations to directly build the mapping from structures to electronic Hamiltonian matrices will greatly improve the efficiency of density functional theory (DFT) calculations. In this work, we proposed a general analytic Hamiltonian representation in an E(3) equivariant framework, which can fit the ab initio Hamiltonian of molecules and solids by a complete data-driven method and are equivariant under rotation, space inversion, and time reversal operations. Our model reached state-of-the-art precision in the benchmark test and accurately predicted the electronic Hamiltonian matrices and related properties of various periodic and aperiodic systems, showing high transferability and generalization ability. This framework provides a general transferable model that can be used to accelerate the electronic structure calculations on different large systems with the same network weights trained on small structures.
    Representation Learning in Continuous-Time Dynamic Signed Networks. (arXiv:2207.03408v3 [cs.SI] UPDATED)
    Signed networks allow us to model conflicting relationships and interactions, such as friend/enemy and support/oppose. These signed interactions happen in real-time. Modeling such dynamics of signed networks is crucial to understanding the evolution of polarization in the network and enabling effective prediction of the signed structure (i.e., link signs and signed weights) in the future. However, existing works have modeled either (static) signed networks or dynamic (unsigned) networks but not dynamic signed networks. Since both sign and dynamics inform the graph structure in different ways, it is non-trivial to model how to combine the two features. In this work, we propose a new Graph Neural Network (GNN)-based approach to model dynamic signed networks, named SEMBA: Signed link's Evolution using Memory modules and Balanced Aggregation. Here, the idea is to incorporate the signs of temporal interactions using separate modules guided by balance theory and to evolve the embeddings from a higher-order neighborhood. Experiments on 4 real-world datasets and 4 different tasks demonstrate that SEMBA consistently and significantly outperforms the baselines by up to $80\%$ on the tasks of predicting signs of future links while matching the state-of-the-art performance on predicting the existence of these links in the future. We find that this improvement is due specifically to the superior performance of SEMBA on the minority negative class.
    GRANDE: a neural model over directed multigraphs with application to anti-money laundering. (arXiv:2302.02101v1 [cs.LG])
    The application of graph representation learning techniques to the area of financial risk management (FRM) has attracted significant attention recently. However, directly modeling transaction networks using graph neural models remains challenging: Firstly, transaction networks are directed multigraphs by nature, which could not be properly handled with most of the current off-the-shelf graph neural networks (GNN). Secondly, a crucial problem in FRM scenarios like anti-money laundering (AML) is to identify risky transactions and is most naturally cast into an edge classification problem with rich edge-level features, which are not fully exploited by the prevailing GNN design that follows node-centric message passing protocols. In this paper, we present a systematic investigation of design aspects of neural models over directed multigraphs and develop a novel GNN protocol that overcomes the above challenges via efficiently incorporating directional information, as well as proposing an enhancement that targets edge-related tasks using a novel message passing scheme over an extension of edge-to-node dual graph. A concrete GNN architecture called GRANDE is derived using the proposed protocol, with several further improvements and generalizations to temporal dynamic graphs. We apply the GRANDE model to both a real-world anti-money laundering task and public datasets. Experimental evaluations show the superiority of the proposed GRANDE architecture over recent state-of-the-art models on dynamic graph modeling and directed graph modeling.
    ReDi: Efficient Learning-Free Diffusion Inference via Trajectory Retrieval. (arXiv:2302.02285v1 [cs.CV])
    Diffusion models show promising generation capability for a variety of data. Despite their high generation quality, the inference for diffusion models is still time-consuming due to the numerous sampling iterations required. To accelerate the inference, we propose ReDi, a simple yet learning-free Retrieval-based Diffusion sampling framework. From a precomputed knowledge base, ReDi retrieves a trajectory similar to the partially generated trajectory at an early stage of generation, skips a large portion of intermediate steps, and continues sampling from a later step in the retrieved trajectory. We theoretically prove that the generation performance of ReDi is guaranteed. Our experiments demonstrate that ReDi improves the model inference efficiency by 2x speedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain image generation such as image stylization.
    Polynomial-time sparse measure recovery. (arXiv:2204.07879v3 [cs.LG] UPDATED)
    Many problems in computer science reduce to the recovery of an $n$-sparse measure from its (generalized) moments. Sparse measure recovery has been the research focus in super-resolution, tensor decomposition, and learning neural networks. The existing methods use either convex relaxations or overparameterization for recovery. Here, we propose recovery with non-convex optimization without overparameterization. Our algorithm is a (sub)gradient descent method optimizing a non-convex energy function studied in physics. We establish the global convergence of gradient descent on the energy function. This result enables us to solve super-resolution in $O(n^2)$ time, which significantly improves upon $O(n^3)$ time for solving convex relaxations. For a particular neural network, we prove the global convergence of subgradient descent on the population loss without overparameterization. The studied network has zero-one activations, and inputs drawn from the unit sphere.
    MTEB: Massive Text Embedding Benchmark. (arXiv:2210.07316v2 [cs.CL] UPDATED)
    Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://github.com/embeddings-benchmark/mteb.
    NeuRI: Diversifying DNN Generation via Inductive Rule Inference. (arXiv:2302.02261v1 [cs.SE])
    Deep Learning (DL) is prevalently used in various industries to improve decision-making and automate processes, driven by the ever-evolving DL libraries and compilers. The correctness of DL systems is crucial for trust in DL applications. As such, the recent wave of research has been studying the automated synthesis of test-cases (i.e., DNN models and their inputs) for fuzzing DL systems. However, existing model generators only subsume a limited number of operators, for lacking the ability to pervasively model operator constraints. To address this challenge, we propose NeuRI, a fully automated approach for generating valid and diverse DL models composed of hundreds of types of operators. NeuRI adopts a three-step process: (i) collecting valid and invalid API traces from various sources; (ii) applying inductive program synthesis over the traces to infer the constraints for constructing valid models; and (iii) performing hybrid model generation by incorporating both symbolic and concrete operators concolically. Our evaluation shows that NeuRI improves branch coverage of TensorFlow and PyTorch by 51% and 15% over the state-of-the-art. Within four months, NeuRI finds 87 new bugs for PyTorch and TensorFlow, with 64 already fixed or confirmed, and 8 high-priority bugs labeled by PyTorch, constituting 10% of all high-priority bugs of the period. Additionally, open-source developers regard error-inducing models reported by us as "high-quality" and "common in practice".
    Recurrence With Correlation Network for Medical Image Registration. (arXiv:2302.02283v1 [cs.CV])
    We present Recurrence with Correlation Network (RWCNet), a medical image registration network with multi-scale features and a cost volume layer. We demonstrate that these architectural features improve medical image registration accuracy in two image registration datasets prepared for the MICCAI 2022 Learn2Reg Workshop Challenge. On the large-displacement National Lung Screening Test (NLST) dataset, RWCNet is able to achieve a total registration error (TRE) of 2.11mm between corresponding keypoints without instance fine-tuning. On the OASIS brain MRI dataset, RWCNet is able to achieve an average dice overlap of 81.7% for 35 different anatomical labels. It outperforms another multi-scale network, the Laplacian Image Registration Network (LapIRN), on both datasets. Ablation experiments are performed to highlight the contribution of the various architectural features. While multi-scale features improved validation accuracy for both datasets, the cost volume layer and number of recurrent steps only improved performance on the large-displacement NLST dataset. This result suggests that cost volume layer and iterative refinement using RNN provide good support for optimization and generalization in large-displacement medical image registration. The code for RWCNet is available at https://github.com/vigsivan/optimization-based-registration.
    Clustering with Neural Network and Index. (arXiv:2212.03853v2 [cs.LG] UPDATED)
    A new model called Clustering with Neural Network and Index (CNNI) is introduced. CNNI uses a Neural Network to cluster data points. Training of the Neural Network mimics supervised learning, with an internal clustering evaluation index acting as the loss function. An experiment is conducted to test the feasibility of the new model, and compared with results of other clustering models like K-means and Gaussian Mixture Model (GMM). The result shows CNNI can work properly for clustering data; CNNI equipped with MMJ-SC, achieves the first parametric (inductive) clustering model that can deal with non-convex shaped (non-flat geometry) data.
    Decentralized Differentially Private Without-Replacement Stochastic Gradient Descent. (arXiv:1809.02727v4 [cs.LG] UPDATED)
    While machine learning has achieved remarkable results in a wide variety of domains, the training of models often requires large datasets that may need to be collected from different individuals. As sensitive information may be contained in the individual's dataset, sharing training data may lead to severe privacy concerns. Therefore, there is a compelling need to develop privacy-aware machine learning methods, for which one effective approach is to leverage the generic framework of differential privacy. Considering that stochastic gradient descent (SGD) is one of the most commonly adopted methods for large-scale machine learning problems, a decentralized differentially private SGD algorithm is proposed in this work. Particularly, we focus on SGD without replacement due to its favorable structure for practical implementation. Both privacy and convergence analysis are provided for the proposed algorithm. Finally, extensive experiments are performed to demonstrate the effectiveness of the proposed method.
    Input Invex Neural Network. (arXiv:2106.08748v3 [cs.LG] UPDATED)
    Connected decision boundaries are useful in several tasks like image segmentation, clustering, alpha-shape or defining a region in nD-space. However, the machine learning literature lacks methods for generating connected decision boundaries using neural networks. Thresholding an invex function, a generalization of a convex function, generates such decision boundaries. This paper presents two methods for constructing invex functions using neural networks. The first approach is based on constraining a neural network with Gradient Clipped-Gradient Penality (GCGP), where we clip and penalise the gradients. In contrast, the second one is based on the relationship of the invex function to the composition of invertible and convex functions. We employ connectedness as a basic interpretation method and create connected region-based classifiers. We show that multiple connected set based classifiers can approximate any classification function. In the experiments section, we use our methods for classification tasks using an ensemble of 1-vs-all models as well as using a single multiclass model on larger-scale datasets. The experiments show that connected set-based classifiers do not pose any disadvantage over ordinary neural network classifiers, but rather, enhance their interpretability. We also did an extensive study on the properties of invex function and connected sets for interpretability and network morphism with experiments on simulated and real-world data sets. Our study suggests that invex function is fundamental to understanding and applying locality and connectedness of input space which is useful for various downstream tasks.
    Direct Advantage Estimation. (arXiv:2109.06093v3 [cs.LG] UPDATED)
    The predominant approach in reinforcement learning is to assign credit to actions based on the expected return. However, we show that the return may depend on the policy in a way which could lead to excessive variance in value estimation and slow down learning. Instead, we show that the advantage function can be interpreted as causal effects and shares similar properties with causal representations. Based on this insight, we propose Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from on-policy data while simultaneously minimizing the variance of the return without requiring the (action-)value function. We also relate our method to Temporal Difference methods by showing how value functions can be seamlessly integrated into DAE. The proposed method is easy to implement and can be readily adapted by modern actor-critic methods. We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.
    Translate First Reorder Later: Leveraging Monotonicity in Semantic Parsing. (arXiv:2210.04878v2 [cs.CL] UPDATED)
    Prior work in semantic parsing has shown that conventional seq2seq models fail at compositional generalization tasks. This limitation led to a resurgence of methods that model alignments between sentences and their corresponding meaning representations, either implicitly through latent variables or explicitly by taking advantage of alignment annotations. We take the second direction and propose TPOL, a two-step approach that first translates input sentences monotonically and then reorders them to obtain the correct output. This is achieved with a modular framework comprising a Translator and a Reorderer component. We test our approach on two popular semantic parsing datasets. Our experiments show that by means of the monotonic translations, TPOL can learn reliable lexico-logical patterns from aligned data, significantly improving compositional generalization both over conventional seq2seq models, as well as over other approaches that exploit gold alignments.
    Synthesising Realistic Calcium Traces of Neuronal Populations Using GAN. (arXiv:2009.02707v3 [q-bio.NC] UPDATED)
    Calcium imaging has become a powerful and popular technique to monitor the activity of large populations of neurons in vivo. However, for ethical considerations and despite recent technical developments, recordings are still constrained to a limited number of trials and animals. This limits the amount of data available from individual experiments and hinders the development of analysis techniques and models for more realistic sizes of neuronal populations. The ability to artificially synthesize realistic neuronal calcium signals could greatly alleviate this problem by scaling up the number of trials. Here, we propose a Generative Adversarial Network (GAN) model to generate realistic calcium signals as seen in neuronal somata with calcium imaging. To this end, we propose CalciumGAN, a model based on the WaveGAN architecture and train it on calcium fluorescent signals with the Wasserstein distance. We test the model on artificial data with known ground-truth and show that the distribution of the generated signals closely resembles the underlying data distribution. Then, we train the model on real calcium traces recorded from the primary visual cortex of behaving mice and confirm that the deconvolved spike trains match the statistics of the recorded data. Together, these results demonstrate that our model can successfully generate realistic calcium traces, thereby providing the means to augment existing datasets of neuronal activity for enhanced data exploration and modelling.
    Boosting Exploration in Multi-Task Reinforcement Learning using Adversarial Networks. (arXiv:2201.11783v3 [cs.LG] UPDATED)
    Advancements in reinforcement learning (RL) have been remarkable in recent years. However, the limitations of traditional training methods have become increasingly evident, particularly in meta-RL settings where agents face new, unseen tasks. Conventional training approaches are susceptible to failure in such situations as they need more robustness to adversity. Our proposed adversarial training regime for Multi-Task Reinforcement Learning (MT-RL) addresses the limitations of conventional training methods in RL, especially in meta-RL environments where the agent faces new tasks. The adversarial component challenges the agent, forcing it to improve its decision-making abilities in dynamic and unpredictable situations. This component operates without relying on manual intervention or domain-specific knowledge, making it a highly versatile solution. Experiments conducted in multiple MT-RL environments demonstrate that adversarial training leads to better exploration and a deeper understanding of the environment. The adversarial training regime for MT-RL presents a new perspective on training and development for RL agents and is a valuable contribution to the field.
    Transformation-Based Models of Video Sequences. (arXiv:1701.08435v3 [cs.LG] UPDATED)
    In this work we propose a simple unsupervised approach for next frame prediction in video. Instead of directly predicting the pixels in a frame given past frames, we predict the transformations needed for generating the next frame in a sequence, given the transformations of the past frames. This leads to sharper results, while using a smaller prediction model. In order to enable a fair comparison between different video frame prediction models, we also propose a new evaluation protocol. We use generated frames as input to a classifier trained with ground truth sequences. This criterion guarantees that models scoring high are those producing sequences which preserve discriminative features, as opposed to merely penalizing any deviation, plausible or not, from the ground truth. Our proposed approach compares favourably against more sophisticated ones on the UCF-101 data set, while also being more efficient in terms of the number of parameters and computational cost.
    Multiscale Graph Comparison via the Embedded Laplacian Discrepancy. (arXiv:2201.12064v2 [stat.ML] UPDATED)
    Laplacian eigenvectors capture natural community structures on graphs and are widely used in spectral clustering and manifold learning. The use of Laplacian eigenvectors as embeddings for the purpose of multiscale graph comparison has however been limited. Here we propose the Embedded Laplacian Discrepancy (ELD) as a simple and fast approach to compare graphs (of potentially different sizes) based on the similarity of the graphs' community structures. The ELD operates by representing graphs as point clouds in a common, low-dimensional space, on which a natural Wasserstein-based distance can be efficiently computed. A main challenge in comparing graphs through any eigenvector-based approaches is the potential ambiguity that could arise due to sign-flips and basis symmetries. The ELD leverages a simple symmetrization trick to bypass any sign ambiguities. For comparing graphs that do not have any ambiguities due to basis symmetries (i.e. the spectrums are simple), we show that the ELD becomes a natural pseudo-metric that enjoys nice properties such as invariance under graph isomorphism. For comparing graphs with non-simple spectrums, we propose a procedure to approximate the ELD via a simple perturbation technique to resolve any ambiguity from basis symmetries. We show that such perturbations are stable using matrix perturbation theory under mild assumptions that are straightforward to verify in practice. We demonstrate the excellent applicability of the ELD approach on both simulated and real datasets.
    Strong Consistency and Rate of Convergence of Switched Least Squares System Identification for Autonomous Markov Jump Linear Systems. (arXiv:2112.10753v2 [cs.LG] UPDATED)
    In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state observations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-independent rate of convergence shows that, almost surely, the system identification error is $\mathcal{O}\big(\sqrt{\log(T)/T} \big)$ where $T$ is the time horizon. These results show that switched least squares method for MJS has the same rate of convergence as least squares method for autonomous linear systems. We derive our results by imposing a general stability assumption on the model called stability in the average sense. We show that stability in the average sense is a weaker form of stability compared to the stability assumptions commonly imposed in the literature. We present numerical examples to illustrate the performance of the proposed method.
    IGRF-RFE: A Hybrid Feature Selection Method for MLP-based Network Intrusion Detection on UNSW-NB15 Dataset. (arXiv:2203.16365v2 [cs.LG] UPDATED)
    The effectiveness of machine learning models is significantly affected by the size of the dataset and the quality of features as redundant and irrelevant features can radically degrade the performance. This paper proposes IGRF-RFE: a hybrid feature selection method tasked for multi-class network anomalies using a Multilayer perceptron (MLP) network. IGRF-RFE can be considered as a feature reduction technique based on both the filter feature selection method and the wrapper feature selection method. In our proposed method, we use the filter feature selection method, which is the combination of Information Gain and Random Forest Importance, to reduce the feature subset search space. Then, we apply recursive feature elimination(RFE) as a wrapper feature selection method to further eliminate redundant features recursively on the reduced feature subsets. Our experimental results obtained based on the UNSW-NB15 dataset confirm that our proposed method can improve the accuracy of anomaly detection while reducing the feature dimension. The results show that the feature dimension is reduced from 42 to 23 while the multi-classification accuracy of MLP is improved from 82.25% to 84.24%.
    Open Problems and Modern Solutions for Deep Reinforcement Learning. (arXiv:2302.02298v1 [cs.LG])
    Deep Reinforcement Learning (DRL) has achieved great success in solving complicated decision-making problems. Despite the successes, DRL is frequently criticized for many reasons, e.g., data inefficient, inflexible and intractable reward design. In this paper, we review two publications that investigate the mentioned issues of DRL and propose effective solutions. One designs the reward for human-robot collaboration by combining the manually designed extrinsic reward with a parameterized intrinsic reward function via the deterministic policy gradient, which improves the task performance and guarantees a stronger obstacle avoidance. The other one applies selective attention and particle filters to rapidly and flexibly attend to and select crucial pre-learned features for DRL using approximate inference instead of backpropagation, thereby improving the efficiency and flexibility of DRL. Potential avenues for future work in both domains are discussed in this paper.
    Approximate Newton policy gradient algorithms. (arXiv:2110.02398v4 [cs.LG] UPDATED)
    Policy gradient algorithms have been widely applied to Markov decision processes and reinforcement learning problems in recent years. Regularization with various entropy functions is often used to encourage exploration and improve stability. This paper proposes an approximate Newton method for the policy gradient algorithm with entropy regularization. In the case of Shannon entropy, the resulting algorithm reproduces the natural policy gradient algorithm. For other entropy functions, this method results in brand-new policy gradient algorithms. We prove that all these algorithms enjoy Newton-type quadratic convergence and that the corresponding gradient flow converges globally to the optimal solution. Using synthetic and industrial-scale examples, we demonstrate that the proposed approximate Newton method typically converges in single-digit iterations, often orders of magnitude faster than other state-of-the-art algorithms.
    A Theoretical Framework for AI Models Explainability. (arXiv:2212.14447v3 [cs.AI] UPDATED)
    EXplainable Artificial Intelligence (XAI) is a vibrant research topic in the artificial intelligence community, with growing interest across methods and domains. Much has been written about the subject, yet XAI still lacks shared terminology and a framework capable of providing structural soundness to explanations. In our work, we address these issues by proposing a novel definition of explanation that is a synthesis of what can be found in the literature. We recognize that explanations are not atomic but the combination of evidence stemming from the model and its input-output mapping, and the human interpretation of this evidence. Furthermore, we fit explanations into the properties of faithfulness (i.e., the explanation being a true description of the model's inner workings and decision-making process) and plausibility (i.e., how much the explanation looks convincing to the user). Using our proposed theoretical framework simplifies how these properties are operationalized and it provides new insight into common explanation methods that we analyze as case studies.
    Joint Reasoning on Hybrid-knowledge sources for Task-Oriented Dialog. (arXiv:2210.07295v2 [cs.CL] UPDATED)
    Traditional systems designed for task oriented dialog utilize knowledge present only in structured knowledge sources to generate responses. However, relevant information required to generate responses may also reside in unstructured sources, such as documents. Recent state of the art models such as HyKnow and SeKnow aimed at overcoming these challenges make limiting assumptions about the knowledge sources. For instance, these systems assume that certain types of information, such as a phone number, is always present in a structured knowledge base (KB) while information about aspects such as entrance ticket prices, would always be available in documents. In this paper, we create a modified version of the MutliWOZ-based dataset prepared by SeKnow to demonstrate how current methods have significant degradation in performance when strict assumptions about the source of information are removed. Then, in line with recent work exploiting pre-trained language models, we fine-tune a BART based model using prompts for the tasks of querying knowledge sources, as well as, for response generation, without making assumptions about the information present in each knowledge source. Through a series of experiments, we demonstrate that our model is robust to perturbations to knowledge modality (source of information), and that it can fuse information from structured as well as unstructured knowledge to generate responses.
    Adversarial Bandits with Knapsacks. (arXiv:1811.11881v10 [cs.DS] UPDATED)
    We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the "classic" adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to the algorithm's reward. We design an algorithm with competitive ratio O(log T) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version and use it as a subroutine to solve the latter.
    DeepPSL: End-to-end perception and reasoning. (arXiv:2109.13662v4 [eess.SY] UPDATED)
    We introduce DeepPSL a variant of probabilistic soft logic (PSL) to produce an end-to-end trainable system that integrates reasoning and perception. PSL represents first-order logic in terms of a convex graphical model -- hinge-loss Markov random fields (HL-MRFs). PSL stands out among probabilistic logic frameworks due to its tractability having been applied to systems of more than 1 billion ground rules. The key to our approach is to represent predicates in first-order logic using deep neural networks and then to approximately back-propagate through the HL-MRF and thus train every aspect of the first-order system being represented. We believe that this approach represents an interesting direction for the integration of deep learning and reasoning techniques with applications to knowledge base learning, multi-task learning, and explainability. Evaluation on three different tasks demonstrates that DeepPSL significantly outperforms state-of-the-art neuro-symbolic methods on scalability while achieving comparable or better accuracy.
    Random Forest Weighted Local Fr\'echet Regression. (arXiv:2202.04912v2 [stat.ML] UPDATED)
    Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and M\"uller (2019) established a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method utilizes these weights as the local average to solve the conditional Fr\'echet mean, while the second method performs local linear Fr\'echet regression, both significantly improving existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order Mmn -estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to human mortality distribution data.
    PIXEL: Physics-Informed Cell Representations for Fast and Accurate PDE Solvers. (arXiv:2207.12800v2 [cs.LG] UPDATED)
    With the increases in computational power and advances in machine learning, data-driven learning-based methods have gained significant attention in solving PDEs. Physics-informed neural networks (PINNs) have recently emerged and succeeded in various forward and inverse PDE problems thanks to their excellent properties, such as flexibility, mesh-free solutions, and unsupervised training. However, their slower convergence speed and relatively inaccurate solutions often limit their broader applicability in many science and engineering domains. This paper proposes a new kind of data-driven PDEs solver, physics-informed cell representations (PIXEL), elegantly combining classical numerical methods and learning-based approaches. We adopt a grid structure from the numerical methods to improve accuracy and convergence speed and overcome the spectral bias presented in PINNs. Moreover, the proposed method enjoys the same benefits in PINNs, e.g., using the same optimization frameworks to solve both forward and inverse PDE problems and readily enforcing PDE constraints with modern automatic differentiation techniques. We provide experimental results on various challenging PDEs that the original PINNs have struggled with and show that PIXEL achieves fast convergence speed and high accuracy. Project page: https://namgyukang.github.io/PIXEL/
    PGNAA Spectral Classification of Metal with Density Estimations. (arXiv:2208.13836v2 [cs.LG] UPDATED)
    For environmental, sustainable economic and political reasons, recycling processes are becoming increasingly important, aiming at a much higher use of secondary raw materials. Currently, for the copper and aluminium industries, no method for the non-destructive online analysis of heterogeneous materials are available. The Prompt Gamma Neutron Activation Analysis (PGNAA) has the potential to overcome this challenge. A difficulty when using PGNAA for online classification arises from the small amount of noisy data, due to short-term measurements. In this case, classical evaluation methods using detailed peak by peak analysis fail. Therefore, we propose to view spectral data as probability distributions. Then, we can classify material using maximum log-likelihood with respect to kernel density estimation and use discrete sampling to optimize hyperparameters. For measurements of pure aluminium alloys we achieve near perfect classification of aluminium alloys under 0.25 second.
    Performance and utility trade-off in interpretable sleep staging. (arXiv:2211.03282v3 [eess.SP] UPDATED)
    Recent advances in deep learning have led to the development of models approaching the human level of accuracy. However, healthcare remains an area lacking in widespread adoption. The safety-critical nature of healthcare results in a natural reticence to put these black-box deep learning models into practice. This paper explores interpretable methods for a clinical decision support system called sleep staging, an essential step in diagnosing sleep disorders. Clinical sleep staging is an arduous process requiring manual annotation for each 30s of sleep using physiological signals such as electroencephalogram (EEG). Recent work has shown that sleep staging using simple models and an exhaustive set of features can perform nearly as well as deep learning approaches but only for some specific datasets. Moreover, the utility of those features from a clinical standpoint is ambiguous. On the other hand, the proposed framework, NormIntSleep demonstrates exceptional performance across different datasets by representing deep learning embeddings using normalized features. NormIntSleep performs 4.5% better than the exhaustive feature-based approach and 1.5% better than other representation learning approaches. An empirical comparison between the utility of the interpretations of these models highlights the improved alignment with clinical expectations when performance is traded-off slightly. NormIntSleep paired with a clinically meaningful set of features can best balance this trade-off by providing reliable, clinically relevant interpretation with robust performance.
    Learning Variational Models with Unrolling and Bilevel Optimization. (arXiv:2209.12651v3 [stat.ML] UPDATED)
    In this paper we consider the problem learning of variational models in the context of supervised learning via risk minimization. Our goal is to provide a deeper understanding of the two approaches of learning of variational models via bilevel optimization and via algorithm unrolling. The former considers the variational model as a lower level optimization problem below the risk minimization problem, while the latter replaces the lower level optimization problem by an algorithm that solves said problem approximately. Both approaches are used in practice, but, unrolling is much simpler from a computational point of view. To analyze and compare the two approaches, we consider a simple toy model, and compute all risks and the respective estimators explicitly. We show that unrolling can be better than the bilevel optimization approach, but also that the performance of unrolling can depend significantly on further parameters, sometimes in unexpected ways: While the stepsize of the unrolled algorithm matters a lot, the number of unrolled iterations only matters if the number is even or odd, and these two cases are notably different.
    Fruit Ripeness Classification: a Survey. (arXiv:2212.14441v2 [cs.CV] UPDATED)
    Fruit is a key crop in worldwide agriculture feeding millions of people. The standard supply chain of fruit products involves quality checks to guarantee freshness, taste, and, most of all, safety. An important factor that determines fruit quality is its stage of ripening. This is usually manually classified by experts in the field, which makes it a labor-intensive and error-prone process. Thus, there is an arising need for automation in the process of fruit ripeness classification. Many automatic methods have been proposed that employ a variety of feature descriptors for the food item to be graded. Machine learning and deep learning techniques dominate the top-performing methods. Furthermore, deep learning can operate on raw data and thus relieve the users from having to compute complex engineered features, which are often crop-specific. In this survey, we review the latest methods proposed in the literature to automatize fruit ripeness classification, highlighting the most common feature descriptors they operate on.
    Side Effects of Learning from Low-dimensional Data Embedded in a Euclidean Space. (arXiv:2203.00614v5 [cs.LG] UPDATED)
    The low-dimensional manifold hypothesis posits that the data found in many applications, such as those involving natural images, lie (approximately) on low-dimensional manifolds embedded in a high-dimensional Euclidean space. In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input. However, one often needs to consider evaluating the optimized network at points outside the training distribution. This paper considers the case in which the training data is distributed in a linear subspace of $\mathbb R^d$. We derive estimates on the variation of the learning function, defined by a neural network, in the direction transversal to the subspace. We study the potential regularization effects associated with the network's depth and noise in the codimension of the data manifold. We also present additional side effects in training due to the presence of noise.
    Physics-Guided, Physics-Informed, and Physics-Encoded Neural Networks in Scientific Computing. (arXiv:2211.07377v2 [cs.LG] UPDATED)
    Recent breakthroughs in computing power have made it feasible to use machine learning and deep learning to advance scientific computing in many fields, including fluid mechanics, solid mechanics, materials science, etc. Neural networks, in particular, play a central role in this hybridization. Due to their intrinsic architecture, conventional neural networks cannot be successfully trained and scoped when data is sparse, which is the case in many scientific and engineering domains. Nonetheless, neural networks provide a solid foundation to respect physics-driven or knowledge-based constraints during training. Generally speaking, there are three distinct neural network frameworks to enforce the underlying physics: (i) physics-guided neural networks (PgNNs), (ii) physics-informed neural networks (PiNNs), and (iii) physics-encoded neural networks (PeNNs). These methods provide distinct advantages for accelerating the numerical modeling of complex multiscale multi-physics phenomena. In addition, the recent developments in neural operators (NOs) add another dimension to these new simulation paradigms, especially when the real-time prediction of complex multi-physics systems is required. All these models also come with their own unique drawbacks and limitations that call for further fundamental research. This study aims to present a review of the four neural network frameworks (i.e., PgNNs, PiNNs, PeNNs, and NOs) used in scientific computing research. The state-of-the-art architectures and their applications are reviewed, limitations are discussed, and future research opportunities in terms of improving algorithms, considering causalities, expanding applications, and coupling scientific and deep learning solvers are presented. This critical review provides researchers and engineers with a solid starting point to comprehend how to integrate different layers of physics into neural networks.
    Convolutional Neural Generative Coding: Scaling Predictive Coding to Natural Images. (arXiv:2211.12047v2 [cs.CV] UPDATED)
    In this work, we develop convolutional neural generative coding (Conv-NGC), a generalization of predictive coding to the case of convolution/deconvolution-based computation. Specifically, we concretely implement a flexible neurobiologically-motivated algorithm that progressively refines latent state feature maps in order to dynamically form a more accurate internal representation/reconstruction model of natural images. The performance of the resulting sensory processing system is evaluated on complex datasets such as Color-MNIST, CIFAR-10, and Street House View Numbers (SVHN). We study the effectiveness of our brain-inspired model on the tasks of reconstruction and image denoising and find that it is competitive with convolutional auto-encoding systems trained by backpropagation of errors and outperforms them with respect to out-of-distribution reconstruction (including the full 90k CINIC-10 test set).
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v4 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(N^{-1}+m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size, $m$ is the worker number, and $1+\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(N^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+\lambda^{1+\alpha} + \phi_{\mathcal{S}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD is positively correlated with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
    CHIMLE: Conditional Hierarchical IMLE for Multimodal Conditional Image Synthesis. (arXiv:2211.14286v2 [cs.CV] UPDATED)
    A persistent challenge in conditional image synthesis has been to generate diverse output images from the same input image despite only one output image being observed per input image. GAN-based methods are prone to mode collapse, which leads to low diversity. To get around this, we leverage Implicit Maximum Likelihood Estimation (IMLE) which can overcome mode collapse fundamentally. IMLE uses the same generator as GANs but trains it with a different, non-adversarial objective which ensures each observed image has a generated sample nearby. Unfortunately, to generate high-fidelity images, prior IMLE-based methods require a large number of samples, which is expensive. In this paper, we propose a new method to get around this limitation, which we dub Conditional Hierarchical IMLE (CHIMLE), which can generate high-fidelity images without requiring many samples. We show CHIMLE significantly outperforms the prior best IMLE, GAN and diffusion-based methods in terms of image fidelity and mode coverage across four tasks, namely night-to-day, 16x single image super-resolution, image colourization and image decompression. Quantitatively, our method improves Fr\'echet Inception Distance (FID) by 36.9% on average compared to the prior best IMLE-based method, and by 27.5% on average compared to the best non-IMLE-based general-purpose methods.
    Real-to-Sim: Predicting Residual Errors of Robotic Systems with Sparse Data using a Learning-based Unscented Kalman Filter. (arXiv:2209.03210v2 [cs.RO] UPDATED)
    Achieving highly accurate dynamic or simulator models that are close to the real robot can facilitate model-based controls (e.g., model predictive control or linear-quadradic regulators), model-based trajectory planning (e.g., trajectory optimization), and decrease the amount of learning time necessary for reinforcement learning methods. Thus, the objective of this work is to learn the residual errors between a dynamic and/or simulator model and the real robot. This is achieved using a neural network, where the parameters of a neural network are updated through an Unscented Kalman Filter (UKF) formulation. Using this method, we model these residual errors with only small amounts of data -- a necessity as we improve the simulator/dynamic model by learning directly from real-world operation. We demonstrate our method on robotic hardware (e.g., manipulator arm, and a wheeled robot), and show that with the learned residual errors, we can further close the reality gap between dynamic models, simulations, and actual hardware.
    The Dual PC Algorithm and the Role of Gaussianity for Structure Learning of Bayesian Networks. (arXiv:2112.09036v4 [stat.ML] UPDATED)
    Learning the graphical structure of Bayesian networks is key to describing data-generating mechanisms in many complex applications but poses considerable computational challenges. Observational data can only identify the equivalence class of the directed acyclic graph underlying a Bayesian network model, and a variety of methods exist to tackle the problem. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by reverse-engineering the conditional independence (CI) relationships holding in the variable distribution. The dual PC algorithm is a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. By exploiting block matrix inversions we can simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies show that the dual PC algorithm outperforms the classic PC algorithm both in terms of run time and in recovering the underlying network structure, even in the presence of deviations from Gaussianity. Additionally, we show that the dual PC algorithm applies for Gaussian copula models, and demonstrate its performance in that setting.
    Asymmetric Certified Robustness via Feature-Convex Neural Networks. (arXiv:2302.01961v1 [cs.LG])
    Recent works have introduced input-convex neural networks (ICNNs) as learning models with advantageous training, inference, and generalization properties linked to their convex structure. In this paper, we propose a novel feature-convex neural network architecture as the composition of an ICNN with a Lipschitz feature map in order to achieve adversarial robustness. We consider the asymmetric binary classification setting with one "sensitive" class, and for this class we prove deterministic, closed-form, and easily-computable certified robust radii for arbitrary $\ell_p$-norms. We theoretically justify the use of these models by characterizing their decision region geometry, extending the universal approximation theorem for ICNN regression to the classification setting, and proving a lower bound on the probability that such models perfectly fit even unstructured uniformly distributed data in sufficiently high dimensions. Experiments on Malimg malware classification and subsets of MNIST, Fashion-MNIST, and CIFAR-10 datasets show that feature-convex classifiers attain state-of-the-art certified $\ell_1$-radii as well as substantial $\ell_2$- and $\ell_{\infty}$-radii while being far more computationally efficient than any competitive baseline.
    Dynamical Equations With Bottom-up Self-Organizing Properties Learn Accurate Dynamical Hierarchies Without Any Loss Function. (arXiv:2302.02140v1 [cs.AI])
    Self-organization is ubiquitous in nature and mind. However, machine learning and theories of cognition still barely touch the subject. The hurdle is that general patterns are difficult to define in terms of dynamical equations and designing a system that could learn by reordering itself is still to be seen. Here, we propose a learning system, where patterns are defined within the realm of nonlinear dynamics with positive and negative feedback loops, allowing attractor-repeller pairs to emerge for each pattern observed. Experiments reveal that such a system can map temporal to spatial correlation, enabling hierarchical structures to be learned from sequential data. The results are accurate enough to surpass state-of-the-art unsupervised learning algorithms in seven out of eight experiments as well as two real-world problems. Interestingly, the dynamic nature of the system makes it inherently adaptive, giving rise to phenomena similar to phase transitions in chemistry/thermodynamics when the input structure changes. Thus, the work here sheds light on how self-organization can allow for pattern recognition and hints at how intelligent behavior might emerge from simple dynamic equations without any objective/loss function.
    PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs. (arXiv:2206.00048v2 [cs.CV] UPDATED)
    Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.
    PubGraph: A Large Scale Scientific Temporal Knowledge Graph. (arXiv:2302.02231v1 [cs.AI])
    Research publications are the primary vehicle for sharing scientific progress in the form of new discoveries, methods, techniques, and insights. Publications have been studied from the perspectives of both content analysis and bibliometric structure, but a barrier to more comprehensive studies of scientific research is a lack of publicly accessible large-scale data and resources. In this paper, we present PubGraph, a new resource for studying scientific progress that takes the form of a large-scale temporal knowledge graph (KG). It contains more than 432M nodes and 15.49B edges mapped to the popular Wikidata ontology. We extract three KGs with varying sizes from PubGraph to allow experimentation at different scales. Using these KGs, we introduce a new link prediction benchmark for transductive and inductive settings with temporally-aligned training, validation, and testing partitions. Moreover, we develop two new inductive learning methods better suited to PubGraph, operating on unseen nodes without explicit features, scaling to large KGs, and outperforming existing models. Our results demonstrate that structural features of past citations are sufficient to produce high-quality predictions about new publications. We also identify new challenges for KG models, including an adversarial community-based link prediction setting, zero-shot inductive learning, and large-scale learning.
    A Singular Woodbury and Pseudo-Determinant Matrix Identities and Application to Gaussian Process Regression. (arXiv:2207.08038v2 [math.ST] UPDATED)
    We study a matrix that arises from a singular form of the Woodbury matrix identity. We present generalized inverse and pseudo-determinant identities for this matrix, which have direct applications for Gaussian process regression, specifically its likelihood representation and precision matrix. We extend the definition of the precision matrix to the Bott-Duffin inverse of the covariance matrix, preserving properties related to conditional independence, conditional precision, and marginal precision. We also provide an efficient algorithm and numerical analysis for the presented determinant identities and demonstrate their advantages under specific conditions relevant to computing log-determinant terms in likelihood functions of Gaussian process regression.
    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v4 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.
    Inorganic synthesis recommendation by machine learning materials similarity from scientific literature. (arXiv:2302.02303v1 [cond-mat.mtrl-sci])
    Synthesis prediction is a key accelerator for the rapid design of advanced materials. However, determining synthesis variables such as the choice of precursor materials, operations, and conditions is challenging for inorganic materials because the sequence of reactions during heating is not well understood. In this work, we use a knowledge base of 29,900 solid-state synthesis recipes, text-mined from the scientific literature, to automatically learn which precursors to recommend for the synthesis of a novel target material. The data-driven approach learns chemical similarity of materials and refers the synthesis of a new target to precedent synthesis procedures of similar materials, mimicking human synthesis design. When proposing five precursor sets for each of 2,654 unseen test target materials, the recommendation strategy achieves a success rate of at least 82%. Our approach captures decades of heuristic synthesis data in a mathematical form, making it accessible for use in recommendation engines and autonomous laboratories.
    Temporally Layered Architecture for Adaptive, Distributed and Continuous Control. (arXiv:2301.00723v2 [cs.NE] UPDATED)
    We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to "act-or-not" at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines.
    Learning to Shape Rewards using a Game of Two Partners. (arXiv:2103.09159v5 [cs.LG] UPDATED)
    Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construction is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA's properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.
    Certified Robust Control under Adversarial Perturbations. (arXiv:2302.02208v1 [cs.LG])
    Autonomous systems increasingly rely on machine learning techniques to transform high-dimensional raw inputs into predictions that are then used for decision-making and control. However, it is often easy to maliciously manipulate such inputs and, as a result, predictions. While effective techniques have been proposed to certify the robustness of predictions to adversarial input perturbations, such techniques have been disembodied from control systems that make downstream use of the predictions. We propose the first approach for composing robustness certification of predictions with respect to raw input perturbations with robust control to obtain certified robustness of control to adversarial input perturbations. We use a case study of adaptive vehicle control to illustrate our approach and show the value of the resulting end-to-end certificates through extensive experiments.
    Koopman Operator Learning: Sharp Spectral Rates and Spurious Eigenvalues. (arXiv:2302.02004v1 [cs.LG])
    Non-linear dynamical systems can be handily described by the associated Koopman operator, whose action evolves every observable of the system forward in time. Learning the Koopman operator from data is enabled by a number of algorithms. In this work we present nonasymptotic learning bounds for the Koopman eigenvalues and eigenfunctions estimated by two popular algorithms: Extended Dynamic Mode Decomposition (EDMD) and Reduced Rank Regression (RRR). We focus on time-reversal-invariant Markov chains, implying that the Koopman operator is self-adjoint. This includes important examples of stochastic dynamical systems, notably Langevin dynamics. Our spectral learning bounds are driven by the simultaneous control of the operator norm risk of the estimators and a metric distortion associated to the corresponding eigenfunctions. Our analysis indicates that both algorithms have similar variance, but EDMD suffers from a larger bias which might be detrimental to its learning rate. We further argue that a large metric distortion may lead to spurious eigenvalues, a phenomenon which has been empirically observed, and note that metric distortion can be estimated from data. Numerical experiments complement the theoretical findings.
    Towards Improving the Generation Quality of Autoregressive Slot VAEs. (arXiv:2206.01370v2 [cs.CV] UPDATED)
    Unconditional scene inference and generation are challenging to learn jointly with a single compositional model. Despite encouraging progress on models that extract object-centric representations ("slots") from images, unconditional generation of scenes from slots has received less attention. This is primarily because learning the multi-object relations necessary to imagine coherent scenes is difficult. We hypothesize that most existing slot-based models have a limited ability to learn object correlations. We propose two improvements that strengthen slot correlation learning. The first is to condition the slots on a global, scene-level variable that captures higher-order correlations between slots. Second, we address the fundamental lack of a canonical order for objects by proposing to learn a consistent order to use for the autoregressive generation of scene objects. Specifically, we train an autoregressive slot prior to sequentially generate scene objects following the learned order. Slot inference entails estimating a randomly ordered set of slots using existing approaches for extracting slots from images, then aligning those slots to ordered slots generated autoregressively with the prior. Our experiments across three multi-object environments demonstrate clear gains in scene generation quality. Detailed ablation studies are also provided that validate the two proposed improvements.
    Sequential pattern mining in educational data: The application context, potential, strengths, and limitations. (arXiv:2302.01932v1 [cs.LG])
    Increasingly, researchers have suggested the benefits of temporal analysis to improve our understanding of the learning process. Sequential pattern mining (SPM), as a pattern recognition technique, has the potential to reveal the temporal aspects of learning and can be a valuable tool in educational data science. However, its potential is not well understood and exploited. This chapter addresses this gap by reviewing work that utilizes sequential pattern mining in educational contexts. We identify that SPM is suitable for mining learning behaviors, analyzing and enriching educational theories, evaluating the efficacy of instructional interventions, generating features for prediction models, and building educational recommender systems. SPM can contribute to these purposes by discovering similarities and differences in learners' activities and revealing the temporal change in learning behaviors. As a sequential analysis method, SPM can reveal unique insights about learning processes and be powerful for self-regulated learning research. It is more flexible in capturing the relative arrangement of learning events than the other sequential analysis methods. Future research may improve its utility in educational data science by developing tools for counting pattern occurrences as well as identifying and removing unreliable patterns. Future work needs to establish a systematic guideline for data preprocessing, parameter setting, and interpreting sequential patterns.
    Joint Learning of Reward Machines and Policies in Environments with Partially Known Semantics. (arXiv:2204.11833v2 [cs.LG] UPDATED)
    We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcement-learning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions' truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from the exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions' truth value. Finally, the algorithm uses a Q-learning procedure for the states of the hypothesis reward machine to determine the policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.
    Efficient Variational Bayes Learning of Graphical Models with Smooth Structural Changes. (arXiv:2009.07703v3 [stat.ML] UPDATED)
    Estimating time-varying graphical models are of paramount importance in various social, financial, biological, and engineering systems, since the evolution of such networks can be utilized for example to spot trends, detect anomalies, predict vulnerability, and evaluate the impact of interventions. Existing methods require extensive tuning of parameters that control the graph sparsity and temporal smoothness. Furthermore, these methods are computationally burdensome with time complexity $O(NP^3)$ for $P$ variables and $N$ time points. As a remedy, we propose a low-complexity tuning-free Bayesian approach, named BASS. Specifically, we impose temporally-dependent spike-and-slab priors on the graphs such that they are sparse and varying smoothly across time. A variational inference algorithm is then derived to learn the graph structures from the data automatically. Owning to the pseudo-likelihood and the mean-field approximation, the time complexity of BASS is only $O(NP^2)$. Additionally, by identifying the frequency-domain resemblance to the time-varying graphical models, we show that BASS can be extended to learning frequency-varying inverse spectral density matrices, and yields graphical models for multivariate stationary time series. Numerical results on both synthetic and real data show that that BASS can better recover the underlying true graphs, while being more efficient than the existing methods, especially for high-dimensional cases.
    SMGRL: Scalable Multi-resolution Graph Representation Learning. (arXiv:2201.12670v2 [cs.LG] UPDATED)
    Graph convolutional networks (GCNs) allow us to learn topologically-aware node embeddings, which can be useful for classification or link prediction. However, they are unable to capture long-range dependencies without adding additional layers -- which in turn leads to over-smoothing and increased time and space complexity. Further, the complex dependencies between nodes make mini-batching challenging, limiting their applicability to large graphs. We propose a Scalable Multi-resolution Graph Representation Learning (SMGRL) framework that enables us to learn multi-resolution node embeddings efficiently. Our framework is model-agnostic and can be applied to any existing GCN model. We dramatically reduce training costs by training only on a reduced-dimension coarsening of the original graph, then exploit self-similarity to apply the resulting algorithm at multiple resolutions. The resulting multi-resolution embeddings can be aggregated to yield high-quality node embeddings that capture both long- and short-range dependencies. Our experiments show that this leads to improved classification accuracy, without incurring high computational costs.
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v4 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    Fair Spatial Indexing: A paradigm for Group Spatial Fairness. (arXiv:2302.02306v1 [cs.LG])
    Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies fairness with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on spatial group fairness and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data.
    Feature-based Individual Fairness in k-Clustering. (arXiv:2109.04554v2 [cs.LG] UPDATED)
    Ensuring fairness in machine learning algorithms is a challenging and essential task. We consider the problem of clustering a set of points while satisfying fairness constraints. While there have been several attempts to capture group fairness in the $k$-clustering problem, fairness at an individual level is relatively less explored. We introduce a new notion of individual fairness in $k$-clustering based on features not necessarily used for clustering. We show that this problem is NP-hard and does not admit a constant factor approximation. Therefore, we design a randomized algorithm that guarantees approximation both in terms of minimizing the clustering distance objective and individual fairness under natural restrictions on the distance metric and fairness constraints. Finally, our experimental results against six competing baselines validate that our algorithm produces individually fairer clusters than the fairest baseline by 12.5% on average while also being less costly in terms of the clustering objective than the best baseline by 34.5% on average.
    HardSATGEN: Understanding the Difficulty of Hard SAT Formula Generation and A Strong Structure-Hardness-Aware Baseline. (arXiv:2302.02104v1 [cs.AI])
    Industrial SAT formula generation is a critical yet challenging task for heuristic development and the surging learning-based methods in practical SAT applications. Existing SAT generation approaches can hardly simultaneously capture the global structural properties and maintain plausible computational hardness, which can be hazardous for the various downstream engagements. To this end, we first present an in-depth analysis for the limitation of previous learning methods in reproducing the computational hardness of original instances, which may stem from the inherent homogeneity in their adopted split-merge procedure. On top of the observations that industrial formulae exhibit clear community structure and oversplit substructures lead to the difficulty in semantic formation of logical structures, we propose HardSATGEN, which introduces a fine-grained control mechanism to the neural split-merge paradigm for SAT formula generation to better recover the structural and computational properties of the industrial benchmarks. Experimental results including evaluations on private corporate data and hyperparameter tuning over solvers in practical use show the significant superiority of HardSATGEN being the only method to successfully augments formulae maintaining similar computational hardness and capturing the global structural properties simultaneously. Compared to the best previous methods to our best knowledge, the average performance gains achieve 38.5% in structural statistics, 88.4% in computational metrics, and over 140.7% in the effectiveness of guiding solver development tuned by our generated instances.
    Optimal lower bounds for Quantum Learning via Information Theory. (arXiv:2301.02227v2 [quant-ph] UPDATED)
    Although a concept class may be learnt more efficiently using quantum samples as compared with classical samples in certain scenarios, Arunachalam and de Wolf (JMLR, 2018) proved that quantum learners are asymptotically no more efficient than classical ones in the quantum PAC and Agnostic learning models. They established lower bounds on sample complexity via quantum state identification and Fourier analysis. In this paper, we derive optimal lower bounds for quantum sample complexity in both the PAC and agnostic models via an information-theoretic approach. The proofs are arguably simpler, and the same ideas can potentially be used to derive optimal bounds for other problems in quantum learning theory. We then turn to a quantum analogue of the Coupon Collector problem, a classic problem from probability theory also of importance in the study of PAC learning. Arunachalam, Belovs, Childs, Kothari, Rosmanis, and de Wolf (TQC, 2020) characterized the quantum sample complexity of this problem up to constant factors. First, we show that the information-theoretic approach mentioned above provably does not yield the optimal lower bound. As a by-product, we get a natural ensemble of pure states in arbitrarily high dimensions which are not easily (simultaneously) distinguishable, while the ensemble has close to maximal Holevo information. Second, we discover that the information-theoretic approach yields an asymptotically optimal bound for an approximation variant of the problem. Finally, we derive a sharp lower bound for the Quantum Coupon Collector problem, with the exact leading order term, via the generalized Holevo-Curlander bounds on the distinguishability of an ensemble. All the aspects of the Quantum Coupon Collector problem we study rest on properties of the spectrum of the associated Gram matrix, which may be of independent interest.
    Domain Adaptation via Rebalanced Sub-domain Alignment. (arXiv:2302.02009v1 [cs.LG])
    Unsupervised domain adaptation (UDA) is a technique used to transfer knowledge from a labeled source domain to a different but related unlabeled target domain. While many UDA methods have shown success in the past, they often assume that the source and target domains must have identical class label distributions, which can limit their effectiveness in real-world scenarios. To address this limitation, we propose a novel generalization bound that reweights source classification error by aligning source and target sub-domains. We prove that our proposed generalization bound is at least as strong as existing bounds under realistic assumptions, and we empirically show that it is much stronger on real-world data. We then propose an algorithm to minimize this novel generalization bound. We demonstrate by numerical experiments that this approach improves performance in shifted class distribution scenarios compared to state-of-the-art methods.
    Multivariate Time Series Anomaly Detection via Dynamic Graph Forecasting. (arXiv:2302.02051v1 [cs.LG])
    Anomalies in univariate time series often refer to abnormal values and deviations from the temporal patterns from majority of historical observations. In multivariate time series, anomalies also refer to abnormal changes in the inter-series relationship, such as correlation, over time. Existing studies have been able to model such inter-series relationships through graph neural networks. However, most works settle on learning a static graph globally or within a context window to assist a time series forecasting task or a reconstruction task, whose objective is not tailored to explicitly detect the abnormal relationship. Some other works detect anomalies based on reconstructing or forecasting a list of inter-series graphs, which inadvertently weakens their power to capture temporal patterns within the data due to the discrete nature of graphs. In this study, we propose DyGraphAD, a multivariate time series anomaly detection framework based upon a list of dynamic inter-series graphs. The core idea is to detect anomalies based on the deviation of inter-series relationships and intra-series temporal patterns from normal to anomalous states, by leveraging the evolving nature of the graphs in order to assist a graph forecasting task and a time series forecasting task simultaneously. Our numerical experiments on real-world datasets demonstrate that DyGraphAD has superior performance than baseline anomaly detection approaches.
    Unsupervised Ensemble Methods for Anomaly Detection in PLC-based Process Control. (arXiv:2302.02097v1 [cs.LG])
    Programmable logic controller (PLC) based industrial control systems (ICS) are used to monitor and control critical infrastructure. Integration of communication networks and an Internet of Things approach in ICS has increased ICS vulnerability to cyber-attacks. This work proposes novel unsupervised machine learning ensemble methods for anomaly detection in PLC-based ICS. The work presents two broad approaches to anomaly detection: a weighted voting ensemble approach with a learning algorithm based on coefficient of determination and a stacking-based ensemble approach using isolation forest meta-detector. The two ensemble methods were analyzed via an open-source PLC-based ICS subjected to multiple attack scenarios as a case study. The work considers four different learning models for the weighted voting ensemble method. Comparative performance analyses of five ensemble methods driven diverse base detectors are presented. Results show that stacking-based ensemble method using isolation forest meta-detector achieves superior performance to previous work on all performance metrics. Results also suggest that effective unsupervised ensemble methods, such as stacking-based ensemble having isolation forest meta-detector, can robustly detect anomalies in arbitrary ICS datasets. Finally, the presented results were validated by using statistical hypothesis tests.
    AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against Decision Tree Models. (arXiv:2302.02162v1 [cs.LG])
    Model extraction attack is one of the most prominent adversarial techniques to target machine learning models along with membership inference attack and model inversion attack. On the other hand, Explainable Artificial Intelligence (XAI) is a set of techniques and procedures to explain the decision making process behind AI. XAI is a great tool to understand the reasoning behind AI models but the data provided for such revelation creates security and privacy vulnerabilities. In this poster, we propose AUTOLYCUS, a model extraction attack that exploits the explanations provided by LIME to infer the decision boundaries of decision tree models and create extracted surrogate models that behave similar to a target model.
    RRNet: Towards ReLU-Reduced Neural Network for Two-party Computation Based Private Inference. (arXiv:2302.02292v1 [cs.CR])
    The proliferation of deep learning (DL) has led to the emergence of privacy and security concerns. To address these issues, secure Two-party computation (2PC) has been proposed as a means of enabling privacy-preserving DL computation. However, in practice, 2PC methods often incur high computation and communication overhead, which can impede their use in large-scale systems. To address this challenge, we introduce RRNet, a systematic framework that aims to jointly reduce the overhead of MPC comparison protocols and accelerate computation through hardware acceleration. Our approach integrates the hardware latency of cryptographic building blocks into the DNN loss function, resulting in improved energy efficiency, accuracy, and security guarantees. Furthermore, we propose a cryptographic hardware scheduler and corresponding performance model for Field Programmable Gate Arrays (FPGAs) to further enhance the efficiency of our framework. Experiments show RRNet achieved a much higher ReLU reduction performance than all SOTA works on CIFAR-10 dataset.
    REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers. (arXiv:2302.02041v1 [cs.LG])
    Tabular data is a common form of organizing data. Multiple models are available to generate synthetic tabular datasets where observations are independent, but few have the ability to produce relational datasets. Modeling relational data is challenging as it requires modeling both a "parent" table and its relationships across tables. We introduce REaLTabFormer (Realistic Relational and Tabular Transformer), a tabular and relational synthetic data generation model. It first creates a parent table using an autoregressive GPT-2 model, then generates the relational dataset conditioned on the parent table using a sequence-to-sequence (Seq2Seq) model. We implement target masking to prevent data copying and propose the $Q_{\delta}$ statistic and statistical bootstrapping to detect overfitting. Experiments using real-world datasets show that REaLTabFormer captures the relational structure better than a baseline model. REaLTabFormer also achieves state-of-the-art results on prediction tasks, "out-of-the-box", for large non-relational datasets without needing fine-tuning.
    How Bad is Top-$K$ Recommendation under Competing Content Creators?. (arXiv:2302.01971v1 [cs.GT])
    Content creators compete for exposure on recommendation platforms, and such strategic behavior leads to a dynamic shift over the content distribution. However, how the creators' competition impacts user welfare and how the relevance-driven recommendation influences the dynamics in the long run are still largely unknown. This work provides theoretical insights into these research questions. We model the creators' competition under the assumptions that: 1) the platform employs an innocuous top-$K$ recommendation policy; 2) user decisions follow the Random Utility model; 3) content creators compete for user engagement and, without knowing their utility function in hindsight, apply arbitrary no-regret learning algorithms to update their strategies. We study the user welfare guarantee through the lens of Price of Anarchy and show that the fraction of user welfare loss due to creator competition is always upper bounded by a small constant depending on $K$ and randomness in user decisions; we also prove the tightness of this bound. Our result discloses an intrinsic merit of the myopic approach to the recommendation, i.e., relevance-driven matching performs reasonably well in the long run, as long as users' decisions involve randomness and the platform provides reasonably many alternatives to its users.
    Conformalized semi-supervised random forest for classification and abnormality detection. (arXiv:2302.02237v1 [cs.LG])
    Traditional classifiers infer labels under the premise that the training and test samples are generated from the same distribution. This assumption can be problematic for safety-critical applications such as medical diagnosis and network attack detection. In this paper, we consider the multi-class classification problem when the training data and the test data may have different distributions. We propose conformalized semi-supervised random forest (CSForest), which constructs set-valued predictions $C(x)$ to include the correct class label with desired probability while detecting outliers efficiently. We compare the proposed method to other state-of-art methods in both a synthetic example and a real data application to demonstrate the strength of our proposal.
    Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization. (arXiv:2302.01987v1 [cs.LG])
    In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery and Individual Causal Discovery. The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walks with restarts to model the network propagation of a system fault. The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets with case studies demonstrate the effectiveness and superiority of the proposed framework.
    On the Analysis of Correlation Between Nominal Data and Numerical Data. (arXiv:2302.02007v1 [cs.LG])
    The article investigates the possibility of measuring the strength of a linear correlation relationship between nominal data and numerical data. Correlation coefficients for variables coded with real numbers as well as for variables coded with complex numbers were studied. For variables coded with real numbers, unambiguous measures of real linear correlation were obtained. In the case of complex coding, it has been observed that the obtained complex correlation coefficients change with the permutation of the phases in the complex numbers used to code classes of elements with equal cardinalities. It was found that a necessary condition for linear correlation is the possibility of linear ordering of a set with data. Since linear order is not possible in the set of complex numbers, complex correlation coefficients cannot be used as a measure of linear correlation. In the event of such a situation, a substitute action was suggested that would prevent equal cardinality of classes of identical elements contained in the set with nominal data. This action would consist in the correction of data, analogous to the correction during preprocessing or cleaning of data containing missing or outlier values.
    Reinforcement Learning with History-Dependent Dynamic Contexts. (arXiv:2302.02061v1 [cs.LG])
    We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.
    Semantic-Guided Image Augmentation with Pre-trained Models. (arXiv:2302.02070v1 [cs.CV])
    Image augmentation is a common mechanism to alleviate data scarcity in computer vision. Existing image augmentation methods often apply pre-defined transformations or mixup to augment the original image, but only locally vary the image. This makes them struggle to find a balance between maintaining semantic information and improving the diversity of augmented images. In this paper, we propose a Semantic-guided Image augmentation method with Pre-trained models (SIP). Specifically, SIP constructs prompts with image labels and captions to better guide the image-to-image generation process of the pre-trained Stable Diffusion model. The semantic information contained in the original images can be well preserved, and the augmented images still maintain diversity. Experimental results show that SIP can improve two commonly used backbones, i.e., ResNet-50 and ViT, by 12.60% and 2.07% on average over seven datasets, respectively. Moreover, SIP not only outperforms the best image augmentation baseline RandAugment by 4.46% and 1.23% on two backbones, but also further improves the performance by integrating naturally with the baseline. A detailed analysis of SIP is presented, including the diversity of augmented images, an ablation study on textual prompts, and a case study on the generated images.
    Federated Temporal Difference Learning with Linear Function Approximation under Environmental Heterogeneity. (arXiv:2302.02212v1 [cs.LG])
    We initiate the study of federated reinforcement learning under environmental heterogeneity by considering a policy evaluation problem. Our setup involves $N$ agents interacting with environments that share the same state and action space but differ in their reward functions and state transition kernels. Assuming agents can communicate via a central server, we ask: Does exchanging information expedite the process of evaluating a common policy? To answer this question, we provide the first comprehensive finite-time analysis of a federated temporal difference (TD) learning algorithm with linear function approximation, while accounting for Markovian sampling, heterogeneity in the agents' environments, and multiple local updates to save communication. Our analysis crucially relies on several novel ingredients: (i) deriving perturbation bounds on TD fixed points as a function of the heterogeneity in the agents' underlying Markov decision processes (MDPs); (ii) introducing a virtual MDP to closely approximate the dynamics of the federated TD algorithm; and (iii) using the virtual MDP to make explicit connections to federated optimization. Putting these pieces together, we rigorously prove that in a low-heterogeneity regime, exchanging model estimates leads to linear convergence speedups in the number of agents.
    Reinforcement Learning in Low-Rank MDPs with Density Features. (arXiv:2302.02252v1 [cs.LG])
    MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density features, i.e., the right matrix, which induce powerful models for state-occupancy distributions. This setting not only sheds light on leveraging unsupervised learning in RL, but also enables plug-in solutions for convex RL. In the offline setting, we propose an algorithm for off-policy estimation of occupancies that can handle non-exploratory data. Using this as a subroutine, we further devise an online algorithm that constructs exploratory data distributions in a level-by-level manner. As a central technical challenge, the additive error of occupancy estimation is incompatible with the multiplicative definition of data coverage. In the absence of strong assumptions like reachability, this incompatibility easily leads to exponential error blow-up, which we overcome via novel technical tools. Our results also readily extend to the representation learning setting, when the density features are unknown and must be learned from an exponentially large candidate set.
    Measuring The Impact Of Programming Language Distribution. (arXiv:2302.01973v1 [cs.LG])
    Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al. 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$.
    Directed Acyclic Graphs With Tears. (arXiv:2302.02160v1 [cs.AI])
    Bayesian network is a frequently-used method for fault detection and diagnosis in industrial processes. The basis of Bayesian network is structure learning which learns a directed acyclic graph (DAG) from data. However, the search space will scale super-exponentially with the increase of process variables, which makes the data-driven structure learning a challenging problem. To this end, the DAGs with NOTEARs methods are being well studied not only for their conversion of the discrete optimization into continuous optimization problem but also their compatibility with deep learning framework. Nevertheless, there still remain challenges for NOTEAR-based methods: 1) the infeasible solution results from the gradient descent-based optimization paradigm; 2) the truncation operation to promise the learned graph acyclic. In this work, the reason for challenge 1) is analyzed theoretically, and a novel method named DAGs with Tears method is proposed based on mix-integer programming to alleviate challenge 2). In addition, prior knowledge is able to incorporate into the new proposed method, making structure learning more practical and useful in industrial processes. Finally, a numerical example and an industrial example are adopted as case studies to demonstrate the superiority of the developed method.
    A neural operator-based surrogate solver for free-form electromagnetic inverse design. (arXiv:2302.01934v1 [physics.comp-ph])
    Neural operators have emerged as a powerful tool for solving partial differential equations in the context of scientific machine learning. Here, we implement and train a modified Fourier neural operator as a surrogate solver for electromagnetic scattering problems and compare its data efficiency to existing methods. We further demonstrate its application to the gradient-based nanophotonic inverse design of free-form, fully three-dimensional electromagnetic scatterers, an area that has so far eluded the application of deep learning techniques.
    SPARLING: Learning Latent Representations with Extremely Sparse Activations. (arXiv:2302.01976v1 [cs.LG])
    Real-world processes often contain intermediate state that can be modeled as an extremely sparse tensor. We introduce Sparling, a new kind of informational bottleneck that explicitly models this state by enforcing extreme activation sparsity. We additionally demonstrate that this technique can be used to learn the true intermediate representation with no additional supervision (i.e., from only end-to-end labeled examples), and thus improve the interpretability of the resulting models. On our DigitCircle domain, we are able to get an intermediate state prediction accuracy of 98.84%, even as we only train end-to-end.
    Decision-Aware Conditional GANs for Time Series Data. (arXiv:2009.12682v4 [cs.LG] UPDATED)
    We introduce the decision-aware time-series conditional generative adversarial network (DAT-CGAN) as a method for time-series generation. The framework adopts a multi-Wasserstein loss on structured decision-related quantities, capturing the heterogeneity of decision-related data and providing new effectiveness in supporting the decision processes of end users. We improve sample efficiency through an overlapped block-sampling method, and provide a theoretical characterization of the generalization properties of DAT-CGAN. The framework is demonstrated on financial time series for a multi-time-step portfolio choice problem. We demonstrate better generative quality in regard to underlying data and different decision-related quantities than strong, GAN-based baselines.
    Interpolation for Robust Learning: Data Augmentation on Geodesics. (arXiv:2302.02092v1 [cs.LG])
    We propose to study and promote the robustness of a model as per its performance through the interpolation of training data distributions. Specifically, (1) we augment the data by finding the worst-case Wasserstein barycenter on the geodesic connecting subpopulation distributions of different categories. (2) We regularize the model for smoother performance on the continuous geodesic path connecting subpopulation distributions. (3) Additionally, we provide a theoretical guarantee of robustness improvement and investigate how the geodesic location and the sample size contribute, respectively. Experimental validations of the proposed strategy on four datasets, including CIFAR-100 and ImageNet, establish the efficacy of our method, e.g., our method improves the baselines' certifiable robustness on CIFAR10 up to $7.7\%$, with $16.8\%$ on empirical robustness on CIFAR-100. Our work provides a new perspective of model robustness through the lens of Wasserstein geodesic-based interpolation with a practical off-the-shelf strategy that can be combined with existing robust training methods.
    MOMA:Distill from Self-Supervised Teachers. (arXiv:2302.02089v1 [cs.CV])
    Contrastive Learning and Masked Image Modelling have demonstrated exceptional performance on self-supervised representation learning, where Momentum Contrast (i.e., MoCo) and Masked AutoEncoder (i.e., MAE) are the state-of-the-art, respectively. In this work, we propose MOMA to distill from pre-trained MoCo and MAE in a self-supervised manner to collaborate the knowledge from both paradigms. We introduce three different mechanisms of knowledge transfer in the propsoed MOMA framework. : (1) Distill pre-trained MoCo to MAE. (2) Distill pre-trained MAE to MoCo (3) Distill pre-trained MoCo and MAE to a random initialized student. During the distillation, the teacher and the student are fed with original inputs and masked inputs, respectively. The learning is enabled by aligning the normalized representations from the teacher and the projected representations from the student. This simple design leads to efficient computation with extremely high mask ratio and dramatically reduced training epochs, and does not require extra considerations on the distillation target. The experiments show MOMA delivers compact student models with comparable performance to existing state-of-the-art methods, combining the power of both self-supervised learning paradigms. It presents competitive results against different benchmarks in computer vision. We hope our method provides an insight on transferring and adapting the knowledge from large-scale pre-trained models in a computationally efficient way.
    Multi-Source Diffusion Models for Simultaneous Music Generation and Separation. (arXiv:2302.02257v1 [cs.SD])
    In this work, we define a diffusion-based generative model capable of both music synthesis and source separation by learning the score of the joint probability density of sources sharing a context. Alongside the classic total inference tasks (i.e. generating a mixture, separating the sources), we also introduce and experiment on the partial inference task of source imputation, where we generate a subset of the sources given the others (e.g., play a piano track that goes well with the drums). Additionally, we introduce a novel inference method for the separation task. We train our model on Slakh2100, a standard dataset for musical source separation, provide qualitative results in the generation settings, and showcase competitive quantitative results in the separation setting. Our method is the first example of a single model that can handle both generation and separation tasks, thus representing a step toward general audio models.
    FedSpectral+: Spectral Clustering using Federated Learning. (arXiv:2302.02137v1 [cs.LG])
    Clustering in graphs has been a well-known research problem, particularly because most Internet and social network data is in the form of graphs. Organizations widely use spectral clustering algorithms to find clustering in graph datasets. However, applying spectral clustering to a large dataset is challenging due to computational overhead. While the distributed spectral clustering algorithm exists, they face the problem of data privacy and increased communication costs between the clients. Thus, in this paper, we propose a spectral clustering algorithm using federated learning (FL) to overcome these issues. FL is a privacy-protecting algorithm that accumulates model parameters from each local learner rather than collecting users' raw data, thus providing both scalability and data privacy. We developed two approaches: FedSpectral and FedSpectral+. FedSpectral is a baseline approach that uses local spectral clustering labels to aggregate the global spectral clustering by creating a similarity graph. FedSpectral+, a state-of-the-art approach, uses the power iteration method to learn the global spectral embedding by incorporating the entire graph data without access to the raw information distributed among the clients. We further designed our own similarity metric to check the clustering quality of the distributed approach to that of the original/non-FL clustering. The proposed approach FedSpectral+ obtained a similarity of 98.85% and 99.8%, comparable to that of global clustering on the ego-Facebook and email-Eu-core dataset.
    Dual Self-Awareness Value Decomposition Framework without Individual Global Max for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2302.02180v1 [cs.MA])
    Value decomposition methods have gradually become popular in the cooperative multi-agent reinforcement learning field. However, almost all value decomposition methods follow the Individual Global Max (IGM) principle or its variants, which restricts the range of issues that value decomposition methods can resolve. Inspired by the notion of dual self-awareness in psychology, we propose a dual self-awareness value decomposition framework that entirely rejects the IGM premise. Each agent consists of an ego policy that carries out actions and an alter ego value function that takes part in credit assignment. The value function factorization can ignore the IGM assumption by using an explicit search procedure. We also suggest a novel anti-ego exploration mechanism to avoid the algorithm becoming stuck in a local optimum. As the first fully IGM-free value decomposition method, our proposed framework achieves desirable performance in various cooperative tasks.
    Diversity Induced Environment Design via Self-Play. (arXiv:2302.02119v1 [cs.AI])
    Recent work on designing an appropriate distribution of environments has shown promise for training effective generally capable agents. Its success is partly because of a form of adaptive curriculum learning that generates environment instances (or levels) at the frontier of the agent's capabilities. However, such an environment design framework often struggles to find effective levels in challenging design spaces and requires costly interactions with the environment. In this paper, we aim to introduce diversity in the Unsupervised Environment Design (UED) framework. Specifically, we propose a task-agnostic method to identify observed/hidden states that are representative of a given level. The outcome of this method is then utilized to characterize the diversity between two levels, which as we show can be crucial to effective performance. In addition, to improve sampling efficiency, we incorporate the self-play technique that allows the environment generator to automatically generate environments that are of great benefit to the training agent. Quantitatively, our approach, Diversity-induced Environment Design via Self-Play (DivSP), shows compelling performance over existing methods.
    Robust Budget Pacing with a Single Sample. (arXiv:2302.02006v1 [cs.LG])
    Major Internet advertising platforms offer budget pacing tools as a standard service for advertisers to manage their ad campaigns. Given the inherent non-stationarity in an advertiser's value and also competing advertisers' values over time, a commonly used approach is to learn a target expenditure plan that specifies a target spend as a function of time, and then run a controller that tracks this plan. This raises the question: how many historical samples are required to learn a good expenditure plan? We study this question by considering an advertiser repeatedly participating in $T$ second-price auctions, where the tuple of her value and the highest competing bid is drawn from an unknown time-varying distribution. The advertiser seeks to maximize her total utility subject to her budget constraint. Prior work has shown the sufficiency of $T\log T$ samples per distribution to achieve the optimal $O(\sqrt{T})$-regret. We dramatically improve this state-of-the-art and show that just one sample per distribution is enough to achieve the near-optimal $\tilde O(\sqrt{T})$-regret, while still being robust to noise in the sampling distributions.
    Structural Explanations for Graph Neural Networks using HSIC. (arXiv:2302.02139v1 [cs.LG])
    Graph neural networks (GNNs) are a type of neural model that tackle graphical tasks in an end-to-end manner. Recently, GNNs have been receiving increased attention in machine learning and data mining communities because of the higher performance they achieve in various tasks, including graph classification, link prediction, and recommendation. However, the complicated dynamics of GNNs make it difficult to understand which parts of the graph features contribute more strongly to the predictions. To handle the interpretability issues, recently, various GNN explanation methods have been proposed. In this study, a flexible model agnostic explanation method is proposed to detect significant structures in graphs using the Hilbert-Schmidt independence criterion (HSIC), which captures the nonlinear dependency between two variables through kernels. More specifically, we extend the GraphLIME method for node explanation with a group lasso and a fused lasso-based node explanation method. The group and fused regularization with GraphLIME enables the interpretation of GNNs in substructure units. Then, we show that the proposed approach can be used for the explanation of sequential graph classification tasks. Through experiments, it is demonstrated that our method can identify crucial structures in a target graph in various settings.
    Oscillation-free Quantization for Low-bit Vision Transformers. (arXiv:2302.02210v1 [cs.CV])
    Weight oscillation is an undesirable side effect of quantization-aware training, in which quantized weights frequently jump between two quantized levels, resulting in training instability and a sub-optimal final model. We discover that the learnable scaling factor, a widely-used $\textit{de facto}$ setting in quantization aggravates weight oscillation. In this study, we investigate the connection between the learnable scaling factor and quantized weight oscillation and use ViT as a case driver to illustrate the findings and remedies. In addition, we also found that the interdependence between quantized weights in $\textit{query}$ and $\textit{key}$ of a self-attention layer makes ViT vulnerable to oscillation. We, therefore, propose three techniques accordingly: statistical weight quantization ($\rm StatsQ$) to improve quantization robustness compared to the prevalent learnable-scale-based method; confidence-guided annealing ($\rm CGA$) that freezes the weights with $\textit{high confidence}$ and calms the oscillating weights; and $\textit{query}$-$\textit{key}$ reparameterization ($\rm QKR$) to resolve the query-key intertwined oscillation and mitigate the resulting gradient misestimation. Extensive experiments demonstrate that these proposed techniques successfully abate weight oscillation and consistently achieve substantial accuracy improvement on ImageNet. Specifically, our 2-bit DeiT-T/DeiT-S algorithms outperform the previous state-of-the-art by 9.8% and 7.7%, respectively. The code is included in the supplementary material and will be released.
    Zero-Shot Robot Manipulation from Passive Human Videos. (arXiv:2302.02011v1 [cs.RO])
    Can we learn robot manipulation for everyday tasks, only by watching videos of humans doing arbitrary tasks in different unstructured settings? Unlike widely adopted strategies of learning task-specific behaviors or direct imitation of a human video, we develop a a framework for extracting agent-agnostic action representations from human videos, and then map it to the agent's embodiment during deployment. Our framework is based on predicting plausible human hand trajectories given an initial image of a scene. After training this prediction model on a diverse set of human videos from the internet, we deploy the trained model zero-shot for physical robot manipulation tasks, after appropriate transformations to the robot's embodiment. This simple strategy lets us solve coarse manipulation tasks like opening and closing drawers, pushing, and tool use, without access to any in-domain robot manipulation trajectories. Our real-world deployment results establish a strong baseline for action prediction information that can be acquired from diverse arbitrary videos of human activities, and be useful for zero-shot robotic manipulation in unseen scenes.
    Backdoor Attacks on Time Series: A Generative Approach. (arXiv:2211.07915v5 [cs.LG] UPDATED)
    Backdoor attacks have emerged as one of the major security threats to deep learning models as they can easily control the model's test-time predictions by pre-injecting a backdoor trigger into the model at training time. While backdoor attacks have been extensively studied on images, few works have investigated the threat of backdoor attacks on time series data. To fill this gap, in this paper we present a novel generative approach for time series backdoor attacks against deep learning based time series classifiers. Backdoor attacks have two main goals: high stealthiness and high attack success rate. We find that, compared to images, it can be more challenging to achieve the two goals on time series. This is because time series have fewer input dimensions and lower degrees of freedom, making it hard to achieve a high attack success rate without compromising stealthiness. Our generative approach addresses this challenge by generating trigger patterns that are as realistic as real-time series patterns while achieving a high attack success rate without causing a significant drop in clean accuracy. We also show that our proposed attack is resistant to potential backdoor defenses. Furthermore, we propose a novel universal generator that can poison any type of time series with a single generator that allows universal attacks without the need to fine-tune the generative model for new time series datasets.
    Federated deep transfer learning for EEG decoding using multiple BCI tasks. (arXiv:2211.10976v3 [eess.SP] UPDATED)
    Deep learning has been successful in BCI decoding. However, it is very data-hungry and requires pooling data from multiple sources. EEG data from various sources decrease the decoding performance due to negative transfer. Recently, transfer learning for EEG decoding has been suggested as a remedy and become subject to recent BCI competitions (e.g. BEETL), but there are two complications in combining data from many subjects. First, privacy is not protected as highly personal brain data needs to be shared (and copied across increasingly tight information governance boundaries). Moreover, BCI data are collected from different sources and are often based on different BCI tasks, which has been thought to limit their reusability. Here, we demonstrate a federated deep transfer learning technique, the Multi-dataset Federated Separate-Common-Separate Network (MF-SCSN) based on our previous work of SCSN, which integrates privacy-preserving properties into deep transfer learning to utilise data sets with different tasks. This framework trains a BCI decoder using different source data sets obtained from different imagery tasks (e.g. some data sets with hands and feet, vs others with single hands and tongue, etc). Therefore, by introducing privacy-preserving transfer learning techniques, we unlock the reusability and scalability of existing BCI data sets. We evaluated our federated transfer learning method on the NeurIPS 2021 BEETL competition BCI task. The proposed architecture outperformed the baseline decoder by 3%. Moreover, compared with the baseline and other transfer learning algorithms, our method protects the privacy of the brain data from different data centres.
    Interaction Order Prediction for Temporal Graphs. (arXiv:2302.02128v1 [cs.SI])
    Link prediction in graphs is a task that has been widely investigated. It has been applied in various domains such as knowledge graph completion, content/item recommendation, social network recommendations and so on. The initial focus of most research was on link prediction in static graphs. However, there has recently been abundant work on modeling temporal graphs, and consequently one of the tasks that has been researched is link prediction in temporal graphs. However, most of the existing work does not focus on the order of link formation, and only predicts the existence of links. In this study, we aim to predict the order of node interactions.
    Representation Deficiency in Masked Language Modeling. (arXiv:2302.02060v1 [cs.CL])
    Masked Language Modeling (MLM) has been one of the most prominent approaches for pretraining bidirectional text encoders due to its simplicity and effectiveness. One notable concern about MLM is that the special $\texttt{[MASK]}$ symbol causes a discrepancy between pretraining data and downstream data as it is present only in pretraining but not in fine-tuning. In this work, we offer a new perspective on the consequence of such a discrepancy: We demonstrate empirically and theoretically that MLM pretraining allocates some model dimensions exclusively for representing $\texttt{[MASK]}$ tokens, resulting in a representation deficiency for real tokens and limiting the pretrained model's expressiveness when it is adapted to downstream data without $\texttt{[MASK]}$ tokens. Motivated by the identified issue, we propose MAE-LM, which pretrains the Masked Autoencoder architecture with MLM where $\texttt{[MASK]}$ tokens are excluded from the encoder. Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
    Transformers as Algorithms: Generalization and Stability in In-context Learning. (arXiv:2301.07067v2 [cs.LG] UPDATED)
    In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. In this work, we formalize in-context learning as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system. The crux of our analysis is relating the excess risk to the stability of the algorithm implemented by the transformer. We characterize when transformer/attention architecture provably obeys the stability condition and also provide empirical verification. For generalization on unseen tasks, we identify an inductive bias phenomenon in which the transfer learning risk is governed by the task complexity and the number of MTL tasks in a highly predictable manner. Finally, we provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) provide insights on stability, and (3) verify our theoretical predictions.
    A Non-monotonic Self-terminating Language Model. (arXiv:2210.00660v2 [cs.LG] UPDATED)
    Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top-$k$ sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-$k$ sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020). We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020), to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.
    This Intestine Does Not Exist: Multiscale Residual Variational Autoencoder for Realistic Wireless Capsule Endoscopy Image Generation. (arXiv:2302.02150v1 [cs.CV])
    Medical image synthesis has emerged as a promising solution to address the limited availability of annotated medical data needed for training machine learning algorithms in the context of image-based Clinical Decision Support (CDS) systems. To this end, Generative Adversarial Networks (GANs) have been mainly applied to support the algorithm training process by generating synthetic images for data augmentation. However, in the field of Wireless Capsule Endoscopy (WCE), the limited content diversity and size of existing publicly available annotated datasets, adversely affect both the training stability and synthesis performance of GANs. Aiming to a viable solution for WCE image synthesis, a novel Variational Autoencoder architecture is proposed, namely "This Intestine Does not Exist" (TIDE). The proposed architecture comprises multiscale feature extraction convolutional blocks and residual connections, which enable the generation of high-quality and diverse datasets even with a limited number of training images. Contrary to the current approaches, which are oriented towards the augmentation of the available datasets, this study demonstrates that using TIDE, real WCE datasets can be fully substituted by artificially generated ones, without compromising classification performance. Furthermore, qualitative and user evaluation studies by experienced WCE specialists, validate from a medical viewpoint that both the normal and abnormal WCE images synthesized by TIDE are sufficiently realistic.
    Hierarchical Learning with Unsupervised Skill Discovery for Highway Merging Applications. (arXiv:2302.02179v1 [cs.LG])
    Driving in dense traffic with human and autonomous drivers is a challenging task that requires high level planning and reasoning along with the ability to react quickly to changes in a dynamic environment. In this study, we propose a hierarchical learning approach that uses learned motion primitives as actions. Motion primitives are obtained using unsupervised skill discovery without a predetermined reward function, allowing them to be reused in different scenarios. This can reduce the total training time for applications that need to obtain multiple models with varying behavior. Simulation results demonstrate that the proposed approach yields driver models that achieve higher performance with less training compared to baseline reinforcement learning methods.
    Model-Aware Contrastive Learning: Towards Escaping the Dilemmas. (arXiv:2207.07874v3 [cs.LG] UPDATED)
    Contrastive learning (CL) continuously achieves significant breakthroughs across multiple domains. However, the most common InfoNCE based methods suffer from some existing dilemmas, e.g., uniformity-tolerance dilemma (UTD) and the gradient reduction. It has been identified that UTD can lead to unexpected performance degradation. We argue that the fixity of temperature is to blame for UTD. To tackle this challenge, we enrich the CL loss family by presenting a Model-Aware Contrastive Learning (MACL) strategy, whose temperature is adaptive to the magnitude of alignment that reflects the basic confidence of the instance discrimination task, then enables CL loss to adjust the penalty strength for hard negatives adaptively. Regarding another dilemma, the gradient reduction issue, we derive the limits of an involved gradient scaling factor, which allows us to explain from a unified perspective why some recent approaches are effective with fewer negative samples, and summarily present a gradient reweighting to escape this dilemma. Extensive remarkable empirical results in vision, sentence, and graph modality validate our approach's general improvement for representation learning and downstream tasks.
    Extracting the gamma-ray source-count distribution below the Fermi-LAT detection limit with deep learning. (arXiv:2302.01947v1 [astro-ph.CO])
    We reconstruct the extra-galactic gamma-ray source-count distribution, or $dN/dS$, of resolved and unresolved sources by adopting machine learning techniques. Specifically, we train a convolutional neural network on synthetic 2-dimensional sky-maps, which are built by varying parameters of underlying source-counts models and incorporate the Fermi-LAT instrumental response functions. The trained neural network is then applied to the Fermi-LAT data, from which we estimate the source count distribution down to flux levels a factor of 50 below the Fermi-LAT threshold. We perform our analysis using 14 years of data collected in the $(1,10)$ GeV energy range. The results we obtain show a source count distribution which, in the resolved regime, is in excellent agreement with the one derived from catalogued sources, and then extends as $dN/dS \sim S^{-2}$ in the unresolved regime, down to fluxes of $5 \cdot 10^{-12}$ cm$^{-2}$ s$^{-1}$. The neural network architecture and the devised methodology have the flexibility to enable future analyses to study the energy dependence of the source-count distribution.
    An Asymptotically Optimal Algorithm for the One-Dimensional Convex Hull Feasibility Problem. (arXiv:2302.02033v1 [stat.ML])
    This work studies the pure-exploration setting for the convex hull feasibility (CHF) problem where one aims to efficiently and accurately determine if a given point lies in the convex hull of means of a finite set of distributions. We give a complete characterization of the sample complexity of the CHF problem in the one-dimensional setting. We present the first asymptotically optimal algorithm called Thompson-CHF, whose modular design consists of a stopping rule and a sampling rule. In addition, we provide an extension of the algorithm that generalizes several important problems in the multi-armed bandit literature. Finally, we further investigate the Gaussian bandit case with unknown variances and address how the Thompson-CHF algorithm can be adjusted to be asymptotically optimal in this setting.
    Matrix Estimation for Individual Fairness. (arXiv:2302.02096v1 [cs.LG])
    In recent years, multiple notions of algorithmic fairness have arisen. One such notion is individual fairness (IF), which requires that individuals who are similar receive similar treatment. In parallel, matrix estimation (ME) has emerged as a natural paradigm for handling noisy data with missing values. In this work, we connect the two concepts. We show that pre-processing data using ME can improve an algorithm's IF without sacrificing performance. Specifically, we show that using a popular ME method known as singular value thresholding (SVT) to pre-process the data provides a strong IF guarantee under appropriate conditions. We then show that, under analogous conditions, SVT pre-processing also yields estimates that are consistent and approximately minimax optimal. As such, the ME pre-processing step does not, under the stated conditions, increase the prediction error of the base algorithm, i.e., does not impose a fairness-performance trade-off. We verify these results on synthetic and real data.
    GAN-based federated learning for label protection in binary classification. (arXiv:2302.02245v1 [cs.LG])
    As an emerging technique, vertical federated learning collaborates with different data sources to jointly train a machine learning model without data exchange. However, federated learning is computationally expensive and inefficient in modeling due to complex encryption algorithms and secure computation protocols. Split learning offers an alternative solution to circumvent these challenges. Despite this, vanilla split learning still suffers privacy leakage. Here, we propose the Generative Adversarial Federated Model (GAFM), which integrates the vanilla split learning framework with the Generative Adversarial Network (GAN) for protection against label leakage from gradients in binary classification tasks. We compare our proposal to existing models, including Marvell, Max Norm, and SplitNN, on three publicly available datasets, where GAFM shows significant improvement regarding the trade-off between classification accuracy and label privacy protection. We also provide heuristic justification for why GAFM can improve over baselines and demonstrate that GAFM offers label protection through gradient perturbation compared to SplitNN.
    Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence. (arXiv:2210.06819v2 [cs.LG] UPDATED)
    The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
    A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation. (arXiv:2211.14296v2 [cs.LG] UPDATED)
    The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization.
    GEDI: GEnerative and DIscriminative Training for Self-Supervised Learning. (arXiv:2212.13425v3 [cs.LG] UPDATED)
    Self-supervised learning is a popular and powerful method for utilizing large amounts of unlabeled data, for which a wide variety of training objectives have been proposed in the literature. In this study, we perform a Bayesian analysis of state-of-the-art self-supervised learning objectives and propose a unified formulation based on likelihood learning. Our analysis suggests a simple method for integrating self-supervised learning with generative models, allowing for the joint training of these two seemingly distinct approaches. We refer to this combined framework as GEDI, which stands for GEnerative and DIscriminative training. Additionally, we demonstrate an instantiation of the GEDI framework by integrating an energy-based model with a cluster-based self-supervised learning model. Through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, we show that GEDI outperforms existing self-supervised learning strategies in terms of clustering performance by a wide margin. We also demonstrate that GEDI can be integrated into a neural-symbolic framework to address tasks in the small data regime, where it can use logical constraints to further improve clustering and classification performance.
    Physics-informed Neural Networks approach to solve the Blasius function. (arXiv:2301.00106v2 [cs.LG] UPDATED)
    Deep learning techniques with neural networks have been used effectively in computational fluid dynamics (CFD) to obtain solutions to nonlinear differential equations. This paper presents a physics-informed neural network (PINN) approach to solve the Blasius function. This method eliminates the process of changing the non-linear differential equation to an initial value problem. Also, it tackles the convergence issue arising in the conventional series solution. It is seen that this method produces results that are at par with the numerical and conventional methods. The solution is extended to the negative axis to show that PINNs capture the singularity of the function at $\eta=-5.69$
    A survey on knowledge-enhanced multimodal learning. (arXiv:2211.12328v2 [cs.LG] UPDATED)
    Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.
    Inferencing the earth moving equipment-environment interaction in open pit mining. (arXiv:2302.02130v1 [cs.LG])
    In mining, grade control generally focuses on blast hole sampling and the estimation of ore control block models with little or no attention given to how the materials are being excavated from the ground. In the process of loading trucks, the underlying variability of the individual bucket load will determine the variability of truck payload. Hence, accurate material movement demands a good knowledge of the excavation process and the buckets interaction with the environment. However, equipment frequently goes into off nominal states due to unexpected delays, disturbances or faults. The large amount of such disturbances causes information loss that reduces the statistical power and biases estimates, leading to increased uncertainty in the production. A reliable method that inferences the missing knowledge about the interaction between the machine and the environment from the available data sources, is vital to accurately model the material movement. In this study, a twostep method was implemented that performed unsupervised clustering and then predicted the missing information. The first method is DBSCAN based spatial clustering which divides the diggers and buckets positional data into connected loading segments. Clear patterns of segmented bucket dig positions were observed. The second model utilized Gaussian process regression which was trained with the clustered data and the model was then used to infer the mean locations of the test clusters. Bucket dig locations were then simulated at the inferred mean locations for different durations and compared against the known bucket dig locations. This method was tested at an open pit mine in the Pilbara of Western Australia. The results demonstrate the advantage of the proposed method in inferencing the missing information of bucket environment interactions and therefore enables miners to continuously track the material movement.
    LAP: An Attention-Based Module for Faithful Interpretation and Knowledge Injection in Convolutional Neural Networks. (arXiv:2201.11808v4 [cs.CV] UPDATED)
    Despite the state-of-the-art performance of deep convolutional neural networks, they are susceptible to bias and malfunction in unseen situations. The complex computation behind their reasoning is not sufficiently human-understandable to develop trust. External explainer methods have tried to interpret the network decisions in a human-understandable way, but they are accused of fallacies due to their assumptions and simplifications. On the other side, the inherent self-interpretability of models, while being more robust to the mentioned fallacies, cannot be applied to the already trained models. In this work, we propose a new attention-based pooling layer, called Local Attention Pooling (LAP), that accomplishes self-interpretability and the possibility for knowledge injection while improving the model's performance. Moreover, several weakly-supervised knowledge injection methodologies are provided to enhance the process of training. We verified our claims by evaluating several LAP-extended models on three different datasets, including Imagenet. The proposed framework offers more valid human-understandable and more faithful-to-the-model interpretations than the commonly used white-box explainer methods.
    CECT: Controllable Ensemble CNN and Transformer for COVID-19 image classification by capturing both local and global image features. (arXiv:2302.02314v1 [eess.IV])
    Purpose: Most computer vision models are developed based on either convolutional neural network (CNN) or transformer, while the former (latter) method captures local (global) features. To relieve model performance limitations due to the lack of global (local) features, we develop a novel classification network named CECT by controllable ensemble CNN and transformer. Methods: The proposed CECT is composed of a CNN-based encoder block, a deconvolution-ensemble decoder block, and a transformer-based classification block. Different from conventional CNN- or transformer-based methods, our CECT can capture features at both multi-local and global scales, and the contribution of local features at different scales can be controlled with the proposed ensemble coefficients. Results: We evaluate CECT on two public COVID-19 datasets and it outperforms other state-of-the-art methods on all evaluation metrics. Conclusion: With remarkable feature capture ability, we believe CECT can also be used in other medical image classification scenarios to assist the diagnosis.
    A Theory of Link Prediction via Relational Weisfeiler-Leman. (arXiv:2302.02209v1 [cs.LG])
    Graph neural networks are prominent models for representation learning over graph-structured data. While the capabilities and limitations of these models are well-understood for simple graphs, our understanding remains highly incomplete in the context of knowledge graphs. The goal of this work is to provide a systematic understanding of the landscape of graph neural networks for knowledge graphs pertaining the prominent task of link prediction. Our analysis entails a unifying perspective on seemingly unrelated models, and unlocks a series of other models. The expressive power of various models is characterized via a corresponding relational Weisfeiler-Leman algorithm with different initialization regimes. This analysis is extended to provide a precise logical characterization of the class of functions captured by a class of graph neural networks. Our theoretical findings explain the benefits of some widely employed practical design choices, which are validated empirically.
    An Uncertainty-aware Loss Function for Training Neural Networks with Calibrated Predictions. (arXiv:2110.03260v2 [cs.LG] UPDATED)
    Uncertainty quantification of machine learning and deep learning methods plays an important role in enhancing trust to the obtained result. In recent years, a numerous number of uncertainty quantification methods have been introduced. Monte Carlo dropout (MC-Dropout) is one of the most well-known techniques to quantify uncertainty in deep learning methods. In this study, we propose two new loss functions by combining cross entropy with Expected Calibration Error (ECE) and Predictive Entropy (PE). The obtained results clearly show that the new proposed loss functions lead to having a calibrated MC-Dropout method. Our results confirmed the great impact of the new hybrid loss functions for minimising the overlap between the distributions of uncertainty estimates for correct and incorrect predictions without sacrificing the model's overall performance.
    Extending Bootstrap AMG for Clustering of Attributed Graphs. (arXiv:2109.09367v4 [cs.LG] UPDATED)
    In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [1, 2]. The augmented graph is then embedded in a Euclidean space associated to its Laplacian and we cluster vertices via a modified K-means algorithm, using a new vector-valued distance in the embedding space. Main novelty of our method, which can be classified as an early fusion method, i.e., a method in which additional information on vertices are fused to the structure information before applying clustering, is the interpretation of attributes as new realizations of graph vertices, which can be dealt with as coordinate vectors in a related Euclidean space. This allows us to extend a scalable generalized spectral clustering procedure which substitutes graph Laplacian eigenvectors with some vectors, named algebraically smooth vectors, obtained by a linear-time complexity Algebraic MultiGrid (AMG) method. We discuss the performance of our proposed clustering method by comparison with recent literature approaches and public available results. Extensive experiments on different types of synthetic datasets and real-world attributed graphs show that our new algorithm, embedding attributes information in the clustering, outperforms structure-only-based methods, when the attributed network has an ambiguous structure. Furthermore, our new method largely outperforms the method which originally proposed the graph augmentation, showing that our embedding strategy and vector-valued distance are very effective in taking advantages from the augmented-graph representation.
    Parallelizing Contextual Bandits. (arXiv:2105.10590v2 [stat.ML] UPDATED)
    Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, \textit{simultaneously} proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual bandit algorithms applicable to problems with bounded eluder dimension whose regret is nearly identical to their perfectly sequential counterparts -- given access to the same total number of oracle queries -- up to a lower-order ``burn-in" term. We further show these algorithms can be specialized to the class of linear reward functions where we introduce and analyze several new linear bandit algorithms which explicitly introduce diversity into their action selection. Finally, we also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems, to demonstrate the utility of parallelized bandits in practical settings.
    Adversarial Learning Data Augmentation for Graph Contrastive Learning in Recommendation. (arXiv:2302.02317v1 [cs.IR])
    Recently, Graph Neural Networks (GNNs) achieve remarkable success in Recommendation. To reduce the influence of data sparsity, Graph Contrastive Learning (GCL) is adopted in GNN-based CF methods for enhancing performance. Most GCL methods consist of data augmentation and contrastive loss (e.g., InfoNCE). GCL methods construct the contrastive pairs by hand-crafted graph augmentations and maximize the agreement between different views of the same node compared to that of other nodes, which is known as the InfoMax principle. However, improper data augmentation will hinder the performance of GCL. InfoMin principle, that the good set of views shares minimal information and gives guidelines to design better data augmentation. In this paper, we first propose a new data augmentation (i.e., edge-operating including edge-adding and edge-dropping). Then, guided by InfoMin principle, we propose a novel theoretical guiding contrastive learning framework, named Learnable Data Augmentation for Graph Contrastive Learning (LDA-GCL). Our methods include data augmentation learning and graph contrastive learning, which follow the InfoMin and InfoMax principles, respectively. In implementation, our methods optimize the adversarial loss function to learn data augmentation and effective representations of users and items. Extensive experiments on four public benchmark datasets demonstrate the effectiveness of LDA-GCL.
    Bag-of-Vectors Autoencoders for Unsupervised Conditional Text Generation. (arXiv:2110.07002v2 [cs.CL] UPDATED)
    Text autoencoders are often used for unsupervised conditional text generation by applying mappings in the latent space to change attributes to the desired values. Recently, Mai et al. (2020) proposed Emb2Emb, a method to learn these mappings in the embedding space of an autoencoder. However, their method is restricted to autoencoders with a single-vector embedding, which limits how much information can be retained. We address this issue by extending their method to Bag-of-Vectors Autoencoders (BoV-AEs), which encode the text into a variable-size bag of vectors that grows with the size of the text, as in attention-based models. This allows to encode and reconstruct much longer texts than standard autoencoders. Analogous to conventional autoencoders, we propose regularization techniques that facilitate learning meaningful operations in the latent space. Finally, we adapt Emb2Emb for a training scheme that learns to map an input bag to an output bag, including a novel loss function and neural architecture. Our empirical evaluations on unsupervised sentiment transfer show that our method performs substantially better than a standard autoencoder.
    Counterfactual Risk Assessments under Unmeasured Confounding. (arXiv:2212.09844v2 [econ.EM] UPDATED)
    Statistical risk assessments inform consequential decisions, such as pretrial release in criminal justice and loan approvals in consumer finance, by counterfactually predicting an outcome under a proposed decision (e.g., would the applicant default if we approved this loan?). There may, however, have been unmeasured confounders that jointly affected decisions and outcomes in the historical data. We propose a mean outcome sensitivity model that bounds the extent to which unmeasured confounders could affect outcomes on average. The mean outcome sensitivity model partially identifies the conditional likelihood of the outcome under the proposed decision, popular predictive performance metrics, and predictive disparities. We derive their identified sets and develop procedures for the confounding-robust learning and evaluation of statistical risk assessments. We propose a nonparametric regression procedure for the bounds on the conditional likelihood of the outcome under the proposed decision, and estimators for the bounds on predictive performance and disparities. Applying our methods to a real-world credit-scoring task, we show how varying assumptions on unmeasured confounding lead to substantive changes in the credit score's predictions and evaluations of its predictive disparities.
    On a continuous time model of gradient descent dynamics and instability in deep learning. (arXiv:2302.01952v1 [stat.ML])
    The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.
    Guaranteed Tensor Recovery Fused Low-rankness and Smoothness. (arXiv:2302.02155v1 [cs.LG])
    The tensor data recovery task has thus attracted much research attention in recent years. Solving such an ill-posed problem generally requires to explore intrinsic prior structures underlying tensor data, and formulate them as certain forms of regularization terms for guiding a sound estimate of the restored tensor. Recent research have made significant progress by adopting two insightful tensor priors, i.e., global low-rankness (L) and local smoothness (S) across different tensor modes, which are always encoded as a sum of two separate regularization terms into the recovery models. However, unlike the primary theoretical developments on low-rank tensor recovery, these joint L+S models have no theoretical exact-recovery guarantees yet, making the methods lack reliability in real practice. To this crucial issue, in this work, we build a unique regularization term, which essentially encodes both L and S priors of a tensor simultaneously. Especially, by equipping this single regularizer into the recovery models, we can rigorously prove the exact recovery guarantees for two typical tensor recovery tasks, i.e., tensor completion (TC) and tensor robust principal component analysis (TRPCA). To the best of our knowledge, this should be the first exact-recovery results among all related L+S methods for tensor recovery. Significant recovery accuracy improvements over many other SOTA methods in several TC and TRPCA tasks with various kinds of visual tensor data are observed in extensive experiments. Typically, our method achieves a workable performance when the missing rate is extremely large, e.g., 99.5%, for the color image inpainting task, while all its peers totally fail in such challenging case.
    Dueling RL: Reinforcement Learning with Trajectory Preferences. (arXiv:2111.04850v3 [cs.LG] UPDATED)
    We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problems with trajectory preferences.
    IoT Botnet Detection Using an Economic Deep Learning Model. (arXiv:2302.02013v1 [cs.CR])
    The rapid progress in technology innovation usage and distribution has increased in the last decade. The rapid growth of the Internet of Things (IoT) systems worldwide has increased network security challenges created by malicious third parties. Thus, reliable intrusion detection and network forensics systems that consider security concerns and IoT systems limitations are essential to protect such systems. IoT botnet attacks are one of the significant threats to enterprises and individuals. Thus, this paper proposed an economic deep learning-based model for detecting IoT botnet attacks along with different types of attacks. The proposed model achieved higher accuracy than the state-of-the-art detection models using a smaller implementation budget and accelerating the training and detecting processes.
    Self-Supervised Transformer Architecture for Change Detection in Radio Access Networks. (arXiv:2302.02025v1 [cs.LG])
    Radio Access Networks (RANs) for telecommunications represent large agglomerations of interconnected hardware consisting of hundreds of thousands of transmitting devices (cells). Such networks undergo frequent and often heterogeneous changes caused by network operators, who are seeking to tune their system parameters for optimal performance. The effects of such changes are challenging to predict and will become even more so with the adoption of 5G/6G networks. Therefore, RAN monitoring is vital for network operators. We propose a self-supervised learning framework that leverages self-attention and self-distillation for this task. It works by detecting changes in Performance Measurement data, a collection of time-varying metrics which reflect a set of diverse measurements of the network performance at the cell level. Experimental results show that our approach outperforms the state of the art by 4% on a real-world based dataset consisting of about hundred thousands timeseries. It also has the merits of being scalable and generalizable. This allows it to provide deep insight into the specifics of mode of operation changes while relying minimally on expert knowledge.
    Generalization of Deep Reinforcement Learning for Jammer-Resilient Frequency and Power Allocation. (arXiv:2302.02250v1 [cs.NI])
    We tackle the problem of joint frequency and power allocation while emphasizing the generalization capability of a deep reinforcement learning model. Most of the existing methods solve reinforcement learning-based wireless problems for a specific pre-determined wireless network scenario. The performance of a trained agent tends to be very specific to the network and deteriorates when used in a different network operating scenario (e.g., different in size, neighborhood, and mobility, among others). We demonstrate our approach to enhance training to enable a higher generalization capability during inference of the deployed model in a distributed multi-agent setting in a hostile jamming environment. With all these, we show the improved training and inference performance of the proposed methods when tested on previously unseen simulated wireless networks of different sizes and architectures. More importantly, to prove practical impact, the end-to-end solution was implemented on the embedded software-defined radio and validated using over-the-air evaluation.
    Counterfactual Identifiability of Bijective Causal Models. (arXiv:2302.02228v1 [stat.ML])
    We study counterfactual identifiability in causal models with bijective generation mechanisms (BGM), a class that generalizes several widely-used causal models in the literature. We establish their counterfactual identifiability for three common causal structures with unobserved confounding, and propose a practical learning method that casts learning a BGM as structured generative modeling. Learned BGMs enable efficient counterfactual estimation and can be obtained using a variety of deep conditional generative models. We evaluate our techniques in a visual task and demonstrate its application in a real-world video streaming simulation task.
    Cross-Frequency Time Series Meta-Forecasting. (arXiv:2302.02077v1 [cs.LG])
    Meta-forecasting is a newly emerging field which combines meta-learning and time series forecasting. The goal of meta-forecasting is to train over a collection of source time series and generalize to new time series one-at-a-time. Previous approaches in meta-forecasting achieve competitive performance, but with the restriction of training a separate model for each sampling frequency. In this work, we investigate meta-forecasting over different sampling frequencies, and introduce a new model, the Continuous Frequency Adapter (CFA), specifically designed to learn frequency-invariant representations. We find that CFA greatly improves performance when generalizing to unseen frequencies, providing a first step towards forecasting over larger multi-frequency datasets.
    Augmenting Rule-based DNS Censorship Detection at Scale with Machine Learning. (arXiv:2302.02031v1 [cs.LG])
    The proliferation of global censorship has led to the development of a plethora of measurement platforms to monitor and expose it. Censorship of the domain name system (DNS) is a key mechanism used across different countries. It is currently detected by applying heuristics to samples of DNS queries and responses (probes) for specific destinations. These heuristics, however, are both platform-specific and have been found to be brittle when censors change their blocking behavior, necessitating a more reliable automated process for detecting censorship. In this paper, we explore how machine learning (ML) models can (1) help streamline the detection process, (2) improve the usability of large-scale datasets for censorship detection, and (3) discover new censorship instances and blocking signatures missed by existing heuristic methods. Our study shows that supervised models, trained using expert-derived labels on instances of known anomalies and possible censorship, can learn the detection heuristics employed by different measurement platforms. More crucially, we find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing heuristics. Moreover, both methods demonstrate the capability to uncover a substantial number of new DNS blocking signatures, i.e., injected fake IP addresses overlooked by existing heuristics. These results are underpinned by an important methodological finding: comparing the outputs of models trained using the same probes but with labels arising from independent processes allows us to more reliably detect cases of censorship in the absence of ground-truth labels of censorship.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v4 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other. Our code is available at https://github.com/chaitjo/efficient-gnns
    Support Vector Regression: Risk Quadrangle Framework. (arXiv:2212.09178v3 [stat.ML] UPDATED)
    This paper investigates Support Vector Regression (SVR) in the context of the fundamental risk quadrangle paradigm. It is shown that both formulations of SVR, $\varepsilon$-SVR and $\nu$-SVR, correspond to the minimization of equivalent regular error measures (Vapnik error and superquantile (CVaR) norm, respectively) with a regularization penalty. These error measures, in turn, give rise to corresponding risk quadrangles. By constructing the fundamental risk quadrangle, which corresponds to SVR, we show that SVR is the asymptotically unbiased estimator of the average of two symmetric conditional quantiles. Furthermore, the technique used for the construction of quadrangles serves as a powerful tool in proving the equivalence between $\varepsilon$-SVR and $\nu$-SVR. Additionally, SVR is formulated as a regular deviation minimization problem with a regularization penalty by invoking Error Shaping Decomposition of Regression and the dual formulation of SVR in the risk quadrangle framework is derived.
    Fixed-kinetic Neural Hamiltonian Flows for enhanced interpretability and reduced complexity. (arXiv:2302.01955v1 [cs.LG])
    Normalizing Flows (NF) are Generative models which are particularly robust and allow for exact sampling of the learned distribution. They however require the design of an invertible mapping, whose Jacobian determinant has to be computable. Recently introduced, Neural Hamiltonian Flows (NHF) are based on Hamiltonian dynamics-based Flows, which are continuous, volume-preserving and invertible and thus make for natural candidates for robust NF architectures. In particular, their similarity to classical Mechanics could lead to easier interpretability of the learned mapping. However, despite being Physics-inspired architectures, the originally introduced NHF architecture still poses a challenge to interpretability. For this reason, in this work, we introduce a fixed kinetic energy version of the NHF model. Inspired by physics, our approach improves interpretability and requires less parameters than previously proposed architectures. We then study the robustness of the NHF architectures to the choice of hyperparameters. We analyze the impact of the number of leapfrog steps, the integration time and the number of neurons per hidden layer, as well as the choice of prior distribution, on sampling a multimodal 2D mixture. The NHF architecture is robust to these choices, especially the fixed-kinetic energy model. Finally, we adapt NHF to the context of Bayesian inference and illustrate our method on sampling the posterior distribution of two cosmological parameters knowing type Ia supernovae observations.
  • Open

    Proper Learning of Linear Dynamical Systems as a Non-Commutative Polynomial Optimisation Problem. (arXiv:2002.01444v4 [math.OC] UPDATED)
    There has been much recent progress in forecasting the next observation of a linear dynamical system (LDS), which is known as the improper learning, as well as in the estimation of its system matrices, which is known as the proper learning of LDS. We present an approach to proper learning of LDS, which in spite of the non-convexity of the problem, guarantees global convergence of numerical solutions to a least-squares estimator. We present promising computational results.  ( 2 min )
    Toward Large Kernel Models. (arXiv:2302.02605v1 [cs.LG])
    Recent studies indicate that kernel machines can often perform similarly or better than deep neural networks (DNNs) on small datasets. The interest in kernel machines has been additionally bolstered by the discovery of their equivalence to wide neural networks in certain regimes. However, a key feature of DNNs is their ability to scale the model size and training data size independently, whereas in traditional kernel machines model size is tied to data size. Because of this coupling, scaling kernel machines to large data has been computationally challenging. In this paper, we provide a way forward for constructing large-scale general kernel models, which are a generalization of kernel machines that decouples the model and data, allowing training on large datasets. Specifically, we introduce EigenPro 3.0, an algorithm based on projected dual preconditioned SGD and show scaling to model and data sizes which have not been possible with existing kernel methods.  ( 2 min )
    Sampling-Based Accuracy Testing of Posterior Estimators for General Inference. (arXiv:2302.03026v1 [stat.ML])
    Parameter inference, i.e. inferring the posterior distribution of the parameters of a statistical model given some data, is a central problem to many scientific disciplines. Posterior inference with generative models is an alternative to methods such as Markov Chain Monte Carlo, both for likelihood-based and simulation-based inference. However, assessing the accuracy of posteriors encoded in generative models is not straightforward. In this paper, we introduce `distance to random point' (DRP) coverage testing as a method to estimate coverage probabilities of generative posterior estimators. Our method differs from previously-existing coverage-based methods, which require posterior evaluations. We prove that our approach is necessary and sufficient to show that a posterior estimator is optimal. We demonstrate the method on a variety of synthetic examples, and show that DRP can be used to test the results of posterior inference analyses in high-dimensional spaces. We also show that our method can detect non-optimal inferences in cases where existing methods fail.  ( 2 min )
    Identifiability of latent-variable and structural-equation models: from linear to nonlinear. (arXiv:2302.02672v1 [stat.ML])
    An old problem in multivariate statistics is that linear Gaussian models are often unidentifiable, i.e. some parameters cannot be uniquely estimated. In factor analysis, an orthogonal rotation of the factors is unidentifiable, while in linear regression, the direction of effect cannot be identified. For such linear models, non-Gaussianity of the (latent) variables has been shown to provide identifiability. In the case of factor analysis, this leads to independent component analysis, while in the case of the direction of effect, non-Gaussian versions of structural equation modelling solve the problem. More recently, we have shown how even general nonparametric nonlinear versions of such models can be estimated. Non-Gaussianity is not enough in this case, but assuming we have time series, or that the distributions are suitably modulated by some observed auxiliary variables, the models are identifiable. This paper reviews the identifiability theory for the linear and nonlinear cases, considering both factor analytic models and structural equation models.
    Counterfactual Identifiability of Bijective Causal Models. (arXiv:2302.02228v1 [stat.ML])
    We study counterfactual identifiability in causal models with bijective generation mechanisms (BGM), a class that generalizes several widely-used causal models in the literature. We establish their counterfactual identifiability for three common causal structures with unobserved confounding, and propose a practical learning method that casts learning a BGM as structured generative modeling. Learned BGMs enable efficient counterfactual estimation and can be obtained using a variety of deep conditional generative models. We evaluate our techniques in a visual task and demonstrate its application in a real-world video streaming simulation task.
    Structural Explanations for Graph Neural Networks using HSIC. (arXiv:2302.02139v1 [cs.LG])
    Graph neural networks (GNNs) are a type of neural model that tackle graphical tasks in an end-to-end manner. Recently, GNNs have been receiving increased attention in machine learning and data mining communities because of the higher performance they achieve in various tasks, including graph classification, link prediction, and recommendation. However, the complicated dynamics of GNNs make it difficult to understand which parts of the graph features contribute more strongly to the predictions. To handle the interpretability issues, recently, various GNN explanation methods have been proposed. In this study, a flexible model agnostic explanation method is proposed to detect significant structures in graphs using the Hilbert-Schmidt independence criterion (HSIC), which captures the nonlinear dependency between two variables through kernels. More specifically, we extend the GraphLIME method for node explanation with a group lasso and a fused lasso-based node explanation method. The group and fused regularization with GraphLIME enables the interpretation of GNNs in substructure units. Then, we show that the proposed approach can be used for the explanation of sequential graph classification tasks. Through experiments, it is demonstrated that our method can identify crucial structures in a target graph in various settings.
    On Representation Knowledge Distillation for Graph Neural Networks. (arXiv:2111.04964v4 [cs.LG] UPDATED)
    Knowledge distillation is a learning paradigm for boosting resource-efficient graph neural networks (GNNs) using more expressive yet cumbersome teacher models. Past work on distillation for GNNs proposed the Local Structure Preserving loss (LSP), which matches local structural relationships defined over edges across the student and teacher's node embeddings. This paper studies whether preserving the global topology of how the teacher embeds graph data can be a more effective distillation objective for GNNs, as real-world graphs often contain latent interactions and noisy edges. We propose Graph Contrastive Representation Distillation (G-CRD), which uses contrastive learning to implicitly preserve global topology by aligning the student node embeddings to those of the teacher in a shared representation space. Additionally, we introduce an expanded set of benchmarks on large-scale real-world datasets where the performance gap between teacher and student GNNs is non-negligible. Experiments across 4 datasets and 14 heterogeneous GNN architectures show that G-CRD consistently boosts the performance and robustness of lightweight GNNs, outperforming LSP (and a global structure preserving variant of LSP) as well as baselines from 2D computer vision. An analysis of the representational similarity among teacher and student embedding spaces reveals that G-CRD balances preserving local and global relationships, while structure preserving approaches are best at preserving one or the other. Our code is available at https://github.com/chaitjo/efficient-gnns
    A Fast Bootstrap Algorithm for Causal Inference with Large Data. (arXiv:2302.02859v1 [stat.ME])
    Estimating causal effects from large experimental and observational data has become increasingly prevalent in both industry and research. The bootstrap is an intuitive and powerful technique used to construct standard errors and confidence intervals of estimators. Its application however can be prohibitively demanding in settings involving large data. In addition, modern causal inference estimators based on machine learning and optimization techniques exacerbate the computational burden of the bootstrap. The bag of little bootstraps has been proposed in non-causal settings for large data but has not yet been applied to evaluate the properties of estimators of causal effects. In this paper, we introduce a new bootstrap algorithm called causal bag of little bootstraps for causal inference with large data. The new algorithm significantly improves the computational efficiency of the traditional bootstrap while providing consistent estimates and desirable confidence interval coverage. We describe its properties, provide practical considerations, and evaluate the performance of the proposed algorithm in terms of bias, coverage of the true 95% confidence intervals, and computational time in a simulation study. We apply it in the evaluation of the effect of hormone therapy on the average time to coronary heart disease using a large observational data set from the Women's Health Initiative.
    Exploring validation metrics for offline model-based optimisation. (arXiv:2211.10747v2 [stat.ML] UPDATED)
    In offline model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of desirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and apply this to conditional denoising diffusion models. Empirically, we find that two validation metrics -- agreement and Frechet distance -- correlate quite well with the ground truth. When there is high variability in conditional generation, feedback is required in the form of an approximated version of the real-world scoring function. Furthermore, we find that generating high-scoring samples may require heavily weighting the generative model in favour of sample quality, potentially at the cost of sample diversity.
    Nonparametric Density Estimation under Distribution Drift. (arXiv:2302.02460v1 [cs.LG])
    We study nonparametric density estimation in non-stationary drift settings. Given a sequence of independent samples taken from a distribution that gradually changes in time, the goal is to compute the best estimate for the current distribution. We prove tight minimax risk bounds for both discrete and continuous smooth densities, where the minimum is over all possible estimates and the maximum is over all possible distributions that satisfy the drift constraints. Our technique handles a broad class of drift models, and generalizes previous results on agnostic learning under drift.
    Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees. (arXiv:2206.02659v4 [cs.LG] UPDATED)
    We consider transfer learning approaches that fine-tune a pretrained deep neural network on a target task. We study the generalization properties of fine-tuning to understand the problem of overfitting, which commonly occurs in practice. Previous works have shown that constraining the distance from the initialization of fine-tuning improves generalization. Using a PAC-Bayesian analysis, we observe that besides distance from initialization, Hessians affect generalization through the noise stability of deep neural networks against noise injections. Motivated by the observation, we develop Hessian distance-based generalization bounds for a wide range of fine-tuning methods. Additionally, we study the robustness of fine-tuning in the presence of noisy labels. We design an algorithm incorporating consistent losses and distance-based regularization for fine-tuning, along with a generalization error guarantee under class conditional independent noise in the training set labels. We perform a detailed empirical study of our algorithm on various noisy environments and architectures. On six image classification tasks whose training labels are generated with programmatic labeling, we find a 3.26% accuracy gain over prior fine-tuning methods. Meanwhile, the Hessian distance measure of the fine-tuned model decreases by six times more than existing approaches.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v4 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(N^{-1}+m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size, $m$ is the worker number, and $1+\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(N^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+\lambda^{1+\alpha} + \phi_{\mathcal{S}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD is positively correlated with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
    Multiscale Graph Comparison via the Embedded Laplacian Discrepancy. (arXiv:2201.12064v2 [stat.ML] UPDATED)
    Laplacian eigenvectors capture natural community structures on graphs and are widely used in spectral clustering and manifold learning. The use of Laplacian eigenvectors as embeddings for the purpose of multiscale graph comparison has however been limited. Here we propose the Embedded Laplacian Discrepancy (ELD) as a simple and fast approach to compare graphs (of potentially different sizes) based on the similarity of the graphs' community structures. The ELD operates by representing graphs as point clouds in a common, low-dimensional space, on which a natural Wasserstein-based distance can be efficiently computed. A main challenge in comparing graphs through any eigenvector-based approaches is the potential ambiguity that could arise due to sign-flips and basis symmetries. The ELD leverages a simple symmetrization trick to bypass any sign ambiguities. For comparing graphs that do not have any ambiguities due to basis symmetries (i.e. the spectrums are simple), we show that the ELD becomes a natural pseudo-metric that enjoys nice properties such as invariance under graph isomorphism. For comparing graphs with non-simple spectrums, we propose a procedure to approximate the ELD via a simple perturbation technique to resolve any ambiguity from basis symmetries. We show that such perturbations are stable using matrix perturbation theory under mild assumptions that are straightforward to verify in practice. We demonstrate the excellent applicability of the ELD approach on both simulated and real datasets.
    Generalization Bounds with Data-dependent Fractal Dimensions. (arXiv:2302.02766v1 [stat.ML])
    Providing generalization guarantees for modern neural networks has been a crucial task in statistical learning. Recently, several studies have attempted to analyze the generalization error in such settings by using tools from fractal geometry. While these works have successfully introduced new mathematical tools to apprehend generalization, they heavily rely on a Lipschitz continuity assumption, which in general does not hold for neural networks and might make the bounds vacuous. In this work, we address this issue and prove fractal geometry-based generalization bounds without requiring any Lipschitz assumption. To achieve this goal, we build up on a classical covering argument in learning theory and introduce a data-dependent fractal dimension. Despite introducing a significant amount of technical complications, this new notion lets us control the generalization error (over either fixed or random hypothesis spaces) along with certain mutual information (MI) terms. To provide a clearer interpretation to the newly introduced MI terms, as a next step, we introduce a notion of "geometric stability" and link our bounds to the prior art. Finally, we make a rigorous connection between the proposed data-dependent dimension and topological data analysis tools, which then enables us to compute the dimension in a numerically efficient way. We support our theory with experiments conducted on various settings.
    Variational Information Pursuit for Interpretable Predictions. (arXiv:2302.02876v1 [cs.LG])
    There is a growing interest in the machine learning community in developing predictive algorithms that are "interpretable by design". Towards this end, recent work proposes to make interpretable decisions by sequentially asking interpretable queries about data until a prediction can be made with high confidence based on the answers obtained (the history). To promote short query-answer chains, a greedy procedure called Information Pursuit (IP) is used, which adaptively chooses queries in order of information gain. Generative models are employed to learn the distribution of query-answers and labels, which is in turn used to estimate the most informative query. However, learning and inference with a full generative model of the data is often intractable for complex tasks. In this work, we propose Variational Information Pursuit (V-IP), a variational characterization of IP which bypasses the need for learning generative models. V-IP is based on finding a query selection strategy and a classifier that minimizes the expected cross-entropy between true and predicted labels. We then demonstrate that the IP strategy is the optimal solution to this problem. Therefore, instead of learning generative models, we can use our optimal strategy to directly pick the most informative query given any history. We then develop a practical algorithm by defining a finite-dimensional parameterization of our strategy and classifier using deep networks and train them end-to-end using our objective. Empirically, V-IP is 10-100x faster than IP on different Vision and NLP tasks with competitive performance. Moreover, V-IP finds much shorter query chains when compared to reinforcement learning which is typically used in sequential-decision-making problems. Finally, we demonstrate the utility of V-IP on challenging tasks like medical diagnosis where the performance is far superior to the generative modelling approach.
    Optimal Transport Guided Unsupervised Learning for Enhancing low-quality Retinal Images. (arXiv:2302.02991v1 [eess.IV])
    Real-world non-mydriatic retinal fundus photography is prone to artifacts, imperfections and low-quality when certain ocular or systemic co-morbidities exist. Artifacts may result in inaccuracy or ambiguity in clinical diagnoses. In this paper, we proposed a simple but effective end-to-end framework for enhancing poor-quality retinal fundus images. Leveraging the optimal transport theory, we proposed an unpaired image-to-image translation scheme for transporting low-quality images to their high-quality counterparts. We theoretically proved that a Generative Adversarial Networks (GAN) model with a generator and discriminator is sufficient for this task. Furthermore, to mitigate the inconsistency of information between the low-quality images and their enhancements, an information consistency mechanism was proposed to maximally maintain structural consistency (optical discs, blood vessels, lesions) between the source and enhanced domains. Extensive experiments were conducted on the EyeQ dataset to demonstrate the superiority of our proposed method perceptually and quantitatively.
    Robust Empirical Risk Minimization with Tolerance. (arXiv:2210.00635v2 [cs.LG] UPDATED)
    Developing simple, sample-efficient learning algorithms for robust classification is a pressing issue in today's tech-dominated world, and current theoretical techniques requiring exponential sample complexity and complicated improper learning rules fall far from answering the need. In this work we study the fundamental paradigm of (robust) $\textit{empirical risk minimization}$ (RERM), a simple process in which the learner outputs any hypothesis minimizing its training error. RERM famously fails to robustly learn VC classes (Montasser et al., 2019a), a bound we show extends even to `nice' settings such as (bounded) halfspaces. As such, we study a recent relaxation of the robust model called $\textit{tolerant}$ robust learning (Ashtiani et al., 2022) where the output classifier is compared to the best achievable error over slightly larger perturbation sets. We show that under geometric niceness conditions, a natural tolerant variant of RERM is indeed sufficient for $\gamma$-tolerant robust learning VC classes over $\mathbb{R}^d$, and requires only $\tilde{O}\left( \frac{VC(H)d\log \frac{D}{\gamma\delta}}{\epsilon^2}\right)$ samples for robustness regions of (maximum) diameter $D$.
    Asymptotically Minimax Optimal Fixed-Budget Best Arm Identification for Expected Simple Regret Minimization. (arXiv:2302.02988v1 [cs.LG])
    We investigate fixed-budget best arm identification (BAI) for expected simple regret minimization. In each round of an adaptive experiment, a decision maker draws one of multiple treatment arms based on past observations and subsequently observes the outcomes of the chosen arm. After the experiment, the decision maker recommends a treatment arm with the highest projected outcome. We evaluate this decision in terms of the expected simple regret, a difference between the expected outcomes of the best and recommended treatment arms. Due to the inherent uncertainty, we evaluate the regret using the minimax criterion. For distributions with fixed variances (location-shift models), such as Gaussian distributions, we derive asymptotic lower bounds for the worst-case expected simple regret. Then, we show that the Random Sampling (RS)-Augmented Inverse Probability Weighting (AIPW) strategy proposed by Kato et al. (2022) is asymptotically minimax optimal in the sense that the leading factor of its worst-case expected simple regret asymptotically matches our derived worst-case lower bound. Our result indicates that, for location-shift models, the optimal RS-AIPW strategy draws treatment arms with varying probabilities based on their variances. This result contrasts with the results of Bubeck et al. (2011), which shows that drawing each treatment arm with an equal ratio is minimax optimal in a bounded outcome setting.
    OTRE: Where Optimal Transport Guided Unpaired Image-to-Image Translation Meets Regularization by Enhancing. (arXiv:2302.03003v1 [eess.IV])
    Non-mydriatic retinal color fundus photography (CFP) is widely available due to the advantage of not requiring pupillary dilation, however, is prone to poor quality due to operators, systemic imperfections, or patient-related causes. Optimal retinal image quality is mandated for accurate medical diagnoses and automated analyses. Herein, we leveraged the \emph{Optimal Transport (OT)} theory to propose an unpaired image-to-image translation scheme for mapping low-quality retinal CFPs to high-quality counterparts. Furthermore, to improve the flexibility, robustness, and applicability of our image enhancement pipeline in the clinical practice, we generalized a state-of-the-art model-based image reconstruction method, regularization by denoising, by plugging in priors learned by our OT-guided image-to-image translation network. We named it as \emph{regularization by enhancing (RE)}. We validated the integrated framework, OTRE, on three publicly available retinal image datasets by assessing the quality after enhancement and their performance on various downstream tasks, including diabetic retinopathy grading, vessel segmentation, and diabetic lesion segmentation. The experimental results demonstrated the superiority of our proposed framework over some state-of-the-art unsupervised competitors and a state-of-the-art supervised method.
    Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence. (arXiv:2210.06819v2 [cs.LG] UPDATED)
    The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
    On Best-Arm Identification with a Fixed Budget in Non-Parametric Multi-Armed Bandits. (arXiv:2210.00895v2 [cs.LG] UPDATED)
    We lay the foundations of a non-parametric theory of best-arm identification in multi-armed bandits with a fixed budget T. We consider general, possibly non-parametric, models D for distributions over the arms; an overarching example is the model D = P(0,1) of all probability distributions over [0,1]. We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities that correspond to infima over Kullback-Leibler divergences between some distributions in D and a given distribution. This is made possible by a refined analysis of the successive-rejects strategy of Audibert, Bubeck, and Munos (2010). We finally provide lower bounds on the same average log-probability, also in terms of the same new information-theoretic quantities; these lower bounds are larger when the (natural) assumptions on the considered strategies are stronger. All these new upper and lower bounds generalize existing bounds based, e.g., on gaps between distributions.
    $z$-SignFedAvg: A Unified Stochastic Sign-based Compression for Federated Learning. (arXiv:2302.02589v1 [cs.LG])
    Federated Learning (FL) is a promising privacy-preserving distributed learning paradigm but suffers from high communication cost when training large-scale machine learning models. Sign-based methods, such as SignSGD \cite{bernstein2018signsgd}, have been proposed as a biased gradient compression technique for reducing the communication cost. However, sign-based algorithms could diverge under heterogeneous data, which thus motivated the development of advanced techniques, such as the error-feedback method and stochastic sign-based compression, to fix this issue. Nevertheless, these methods still suffer from slower convergence rates. Besides, none of them allows multiple local SGD updates like FedAvg \cite{mcmahan2017communication}. In this paper, we propose a novel noisy perturbation scheme with a general symmetric noise distribution for sign-based compression, which not only allows one to flexibly control the tradeoff between gradient bias and convergence performance, but also provides a unified viewpoint to existing stochastic sign-based methods. More importantly, the unified noisy perturbation scheme enables the development of the very first sign-based FedAvg algorithm ($z$-SignFedAvg) to accelerate the convergence. Theoretically, we show that $z$-SignFedAvg achieves a faster convergence rate than existing sign-based methods and, under the uniformly distributed noise, can enjoy the same convergence rate as its uncompressed counterpart. Extensive experiments are conducted to demonstrate that the $z$-SignFedAvg can achieve competitive empirical performance on real datasets and outperforms existing schemes.
    In Search of Insights, Not Magic Bullets: Towards Demystification of the Model Selection Dilemma in Heterogeneous Treatment Effect Estimation. (arXiv:2302.02923v1 [stat.ML])
    Personalized treatment effect estimates are often of interest in high-stakes applications -- thus, before deploying a model estimating such effects in practice, one needs to be sure that the best candidate from the ever-growing machine learning toolbox for this task was chosen. Unfortunately, due to the absence of counterfactual information in practice, it is usually not possible to rely on standard validation metrics for doing so, leading to a well-known model selection dilemma in the treatment effect estimation literature. While some solutions have recently been investigated, systematic understanding of the strengths and weaknesses of different model selection criteria is still lacking. In this paper, instead of attempting to declare a global `winner', we therefore empirically investigate success- and failure modes of different selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the DGP used for testing, and provide interesting insights into the relative (dis)advantages of different criteria alongside desiderata for the design of further illuminating empirical studies in this context.
    The SSL Interplay: Augmentations, Inductive Bias, and Generalization. (arXiv:2302.02774v1 [stat.ML])
    Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in a theory friendly setup, and highlight several insights for SSL practitioners that arise from our theory.
    Causal Shift-Response Functions with Neural Networks: The Health Benefits of Lowering Air Quality Standards in the US. (arXiv:2302.02560v1 [cs.LG])
    Policymakers are required to evaluate the health benefits of reducing the National Ambient Air Quality Standards (NAAQS; i.e., the safety standards) for fine particulate matter PM 2.5 before implementing new policies. We formulate this objective as a shift-response function (SRF) and develop methods to analyze the problem using methods for causal inference, specifically under the stochastic interventions framework. SRFs model the average change in an outcome of interest resulting from a hypothetical shift in the observed exposure distribution. We propose a new broadly applicable doubly-robust method to learn SRFs using targeted regularization with neural networks. We evaluate our proposed method under various benchmarks specific for marginal estimates as a function of continuous exposure. Finally, we implement our estimator in the motivating application that considers the potential reduction in deaths from lowering the NAAQS from the current level of 12 $\mu g/m^3$ to levels that are recently proposed by the Environmental Protection Agency in the US (10, 9, and 8 $\mu g/m^3$).
    Uncertainty Calibration and its Application to Object Detection. (arXiv:2302.02622v1 [cs.CV])
    Image-based environment perception is an important component especially for driver assistance systems or autonomous driving. In this scope, modern neuronal networks are used to identify multiple objects as well as the according position and size information within a single frame. The performance of such an object detection model is important for the overall performance of the whole system. However, a detection model might also predict these objects under a certain degree of uncertainty. [...] In this work, we examine the semantic uncertainty (which object type?) as well as the spatial uncertainty (where is the object and how large is it?). We evaluate if the predicted uncertainties of an object detection model match with the observed error that is achieved on real-world data. In the first part of this work, we introduce the definition for confidence calibration of the semantic uncertainty in the context of object detection, instance segmentation, and semantic segmentation. We integrate additional position information in our examinations to evaluate the effect of the object's position on the semantic calibration properties. Besides measuring calibration, it is also possible to perform a post-hoc recalibration of semantic uncertainty that might have turned out to be miscalibrated. [...] The second part of this work deals with the spatial uncertainty obtained by a probabilistic detection model. [...] We review and extend common calibration methods so that it is possible to obtain parametric uncertainty distributions for the position information in a more flexible way. In the last part, we demonstrate a possible use-case for our derived calibration methods in the context of object tracking. [...] We integrate our previously proposed calibration techniques and demonstrate the usefulness of semantic and spatial uncertainty calibration in a subsequent process. [...]
    Interpolation for Robust Learning: Data Augmentation on Geodesics. (arXiv:2302.02092v1 [cs.LG])
    We propose to study and promote the robustness of a model as per its performance through the interpolation of training data distributions. Specifically, (1) we augment the data by finding the worst-case Wasserstein barycenter on the geodesic connecting subpopulation distributions of different categories. (2) We regularize the model for smoother performance on the continuous geodesic path connecting subpopulation distributions. (3) Additionally, we provide a theoretical guarantee of robustness improvement and investigate how the geodesic location and the sample size contribute, respectively. Experimental validations of the proposed strategy on four datasets, including CIFAR-100 and ImageNet, establish the efficacy of our method, e.g., our method improves the baselines' certifiable robustness on CIFAR10 up to $7.7\%$, with $16.8\%$ on empirical robustness on CIFAR-100. Our work provides a new perspective of model robustness through the lens of Wasserstein geodesic-based interpolation with a practical off-the-shelf strategy that can be combined with existing robust training methods.
    A Log-Linear Non-Parametric Online Changepoint Detection Algorithm based on Functional Pruning. (arXiv:2302.02718v1 [stat.ME])
    Online changepoint detection aims to detect anomalies and changes in real-time in high-frequency data streams, sometimes with limited available computational resources. This is an important task that is rooted in many real-world applications, including and not limited to cybersecurity, medicine and astrophysics. While fast and efficient online algorithms have been recently introduced, these rely on parametric assumptions which are often violated in practical applications. Motivated by data streams from the telecommunications sector, we build a flexible nonparametric approach to detect a change in the distribution of a sequence. Our procedure, NP-FOCuS, builds a sequential likelihood ratio test for a change in a set of points of the empirical cumulative density function of our data. This is achieved by keeping track of the number of observations above or below those points. Thanks to functional pruning ideas, NP-FOCuS has a computational cost that is log-linear in the number of observations and is suitable for high-frequency data streams. In terms of detection power, NP-FOCuS is seen to outperform current nonparametric online changepoint techniques in a variety of settings. We demonstrate the utility of the procedure on both simulated and real data.
    High-dimensional Location Estimation via Norm Concentration for Subgamma Vectors. (arXiv:2302.02497v1 [math.ST])
    In location estimation, we are given $n$ samples from a known distribution $f$ shifted by an unknown translation $\lambda$, and want to estimate $\lambda$ as precisely as possible. Asymptotically, the maximum likelihood estimate achieves the Cram\'er-Rao bound of error $\mathcal N(0, \frac{1}{n\mathcal I})$, where $\mathcal I$ is the Fisher information of $f$. However, the $n$ required for convergence depends on $f$, and may be arbitrarily large. We build on the theory using \emph{smoothed} estimators to bound the error for finite $n$ in terms of $\mathcal I_r$, the Fisher information of the $r$-smoothed distribution. As $n \to \infty$, $r \to 0$ at an explicit rate and this converges to the Cram\'er-Rao bound. We (1) improve the prior work for 1-dimensional $f$ to converge for constant failure probability in addition to high probability, and (2) extend the theory to high-dimensional distributions. In the process, we prove a new bound on the norm of a high-dimensional random variable whose 1-dimensional projections are subgamma, which may be of independent interest.
    Efficient Variational Bayes Learning of Graphical Models with Smooth Structural Changes. (arXiv:2009.07703v3 [stat.ML] UPDATED)
    Estimating time-varying graphical models are of paramount importance in various social, financial, biological, and engineering systems, since the evolution of such networks can be utilized for example to spot trends, detect anomalies, predict vulnerability, and evaluate the impact of interventions. Existing methods require extensive tuning of parameters that control the graph sparsity and temporal smoothness. Furthermore, these methods are computationally burdensome with time complexity $O(NP^3)$ for $P$ variables and $N$ time points. As a remedy, we propose a low-complexity tuning-free Bayesian approach, named BASS. Specifically, we impose temporally-dependent spike-and-slab priors on the graphs such that they are sparse and varying smoothly across time. A variational inference algorithm is then derived to learn the graph structures from the data automatically. Owning to the pseudo-likelihood and the mean-field approximation, the time complexity of BASS is only $O(NP^2)$. Additionally, by identifying the frequency-domain resemblance to the time-varying graphical models, we show that BASS can be extended to learning frequency-varying inverse spectral density matrices, and yields graphical models for multivariate stationary time series. Numerical results on both synthetic and real data show that that BASS can better recover the underlying true graphs, while being more efficient than the existing methods, especially for high-dimensional cases.
    Decision-Aware Conditional GANs for Time Series Data. (arXiv:2009.12682v4 [cs.LG] UPDATED)
    We introduce the decision-aware time-series conditional generative adversarial network (DAT-CGAN) as a method for time-series generation. The framework adopts a multi-Wasserstein loss on structured decision-related quantities, capturing the heterogeneity of decision-related data and providing new effectiveness in supporting the decision processes of end users. We improve sample efficiency through an overlapped block-sampling method, and provide a theoretical characterization of the generalization properties of DAT-CGAN. The framework is demonstrated on financial time series for a multi-time-step portfolio choice problem. We demonstrate better generative quality in regard to underlying data and different decision-related quantities than strong, GAN-based baselines.
    Strong Consistency and Rate of Convergence of Switched Least Squares System Identification for Autonomous Markov Jump Linear Systems. (arXiv:2112.10753v2 [cs.LG] UPDATED)
    In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state observations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-independent rate of convergence shows that, almost surely, the system identification error is $\mathcal{O}\big(\sqrt{\log(T)/T} \big)$ where $T$ is the time horizon. These results show that switched least squares method for MJS has the same rate of convergence as least squares method for autonomous linear systems. We derive our results by imposing a general stability assumption on the model called stability in the average sense. We show that stability in the average sense is a weaker form of stability compared to the stability assumptions commonly imposed in the literature. We present numerical examples to illustrate the performance of the proposed method.
    Pre-screening breast cancer with machine learning and deep learning. (arXiv:2302.02406v1 [stat.ML])
    We suggest that deep learning can be used for pre-screening cancer by analyzing demographic and anthropometric information of patients, as well as biological markers obtained from routine blood samples and relative risks obtained from meta-analysis and international databases. We applied feature selection algorithms to a database of 116 women, including 52 healthy women and 64 women diagnosed with breast cancer, to identify the best pre-screening predictors of cancer. We utilized the best predictors to perform k-fold Monte Carlo cross-validation experiments that compare deep learning against traditional machine learning algorithms. Our results indicate that a deep learning model with an input-layer architecture that is fine-tuned using feature selection can effectively distinguish between patients with and without cancer. Additionally, compared to machine learning, deep learning has the lowest uncertainty in its predictions. These findings suggest that deep learning algorithms applied to cancer pre-screening offer a radiation-free, non-invasive, and affordable complement to screening methods based on imagery. The implementation of deep learning algorithms in cancer pre-screening offer opportunities to identify individuals who may require imaging-based screening, can encourage self-examination, and decrease the psychological externalities associated with false positives in cancer screening. The integration of deep learning algorithms for both screening and pre-screening will ultimately lead to earlier detection of malignancy, reducing the healthcare and societal burden associated to cancer treatment.
    Parallelizing Contextual Bandits. (arXiv:2105.10590v2 [stat.ML] UPDATED)
    Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, \textit{simultaneously} proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual bandit algorithms applicable to problems with bounded eluder dimension whose regret is nearly identical to their perfectly sequential counterparts -- given access to the same total number of oracle queries -- up to a lower-order ``burn-in" term. We further show these algorithms can be specialized to the class of linear reward functions where we introduce and analyze several new linear bandit algorithms which explicitly introduce diversity into their action selection. Finally, we also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems, to demonstrate the utility of parallelized bandits in practical settings.
    Learning Variational Models with Unrolling and Bilevel Optimization. (arXiv:2209.12651v3 [stat.ML] UPDATED)
    In this paper we consider the problem learning of variational models in the context of supervised learning via risk minimization. Our goal is to provide a deeper understanding of the two approaches of learning of variational models via bilevel optimization and via algorithm unrolling. The former considers the variational model as a lower level optimization problem below the risk minimization problem, while the latter replaces the lower level optimization problem by an algorithm that solves said problem approximately. Both approaches are used in practice, but, unrolling is much simpler from a computational point of view. To analyze and compare the two approaches, we consider a simple toy model, and compute all risks and the respective estimators explicitly. We show that unrolling can be better than the bilevel optimization approach, but also that the performance of unrolling can depend significantly on further parameters, sometimes in unexpected ways: While the stepsize of the unrolled algorithm matters a lot, the number of unrolled iterations only matters if the number is even or odd, and these two cases are notably different.
    Reinforcement Learning in Low-Rank MDPs with Density Features. (arXiv:2302.02252v1 [cs.LG])
    MDPs with low-rank transitions -- that is, the transition matrix can be factored into the product of two matrices, left and right -- is a highly representative structure that enables tractable learning. The left matrix enables expressive function approximation for value-based learning and has been studied extensively. In this work, we instead investigate sample-efficient learning with density features, i.e., the right matrix, which induce powerful models for state-occupancy distributions. This setting not only sheds light on leveraging unsupervised learning in RL, but also enables plug-in solutions for convex RL. In the offline setting, we propose an algorithm for off-policy estimation of occupancies that can handle non-exploratory data. Using this as a subroutine, we further devise an online algorithm that constructs exploratory data distributions in a level-by-level manner. As a central technical challenge, the additive error of occupancy estimation is incompatible with the multiplicative definition of data coverage. In the absence of strong assumptions like reachability, this incompatibility easily leads to exponential error blow-up, which we overcome via novel technical tools. Our results also readily extend to the representation learning setting, when the density features are unknown and must be learned from an exponentially large candidate set.
    Sequential change detection via backward confidence sequences. (arXiv:2302.02544v1 [math.ST])
    We present a simple reduction from sequential estimation to sequential changepoint detection (SCD). In short, suppose we are interested in detecting changepoints in some parameter or functional $\theta$ of the underlying distribution. We demonstrate that if we can construct a confidence sequence (CS) for $\theta$, then we can also successfully perform SCD for $\theta$. This is accomplished by checking if two CSs -- one forwards and the other backwards -- ever fail to intersect. Since the literature on CSs has been rapidly evolving recently, the reduction provided in this paper immediately solves several old and new change detection problems. Further, our "backward CS", constructed by reversing time, is new and potentially of independent interest. We provide strong nonasymptotic guarantees on the frequency of false alarms and detection delay, and demonstrate numerical effectiveness on several problems.
    A Characteristic Function for Shapley-Value-Based\\Attribution of Anomaly Scores. (arXiv:2004.04464v2 [cs.LG] UPDATED)
    In anomaly detection, the degree of irregularity is often summarized as a real-valued anomaly score. We address the problem of attributing such anomaly scores to input features for interpreting the results of anomaly detection. We particularly investigate the use of the Shapley value for attributing anomaly scores of semi-supervised detection methods. We propose a characteristic function specifically designed for attributing anomaly scores. The idea is to approximate the absence of some features by locally minimizing the anomaly score with regard to the to-be-absent features. We examine the applicability of the proposed characteristic function and other general approaches for interpreting anomaly scores on multiple datasets and multiple anomaly detection methods. The results indicate the potential utility of the attribution methods including the proposed one.
    On Private and Robust Bandits. (arXiv:2302.02526v1 [cs.LG])
    We study private and robust multi-armed bandits (MABs), where the agent receives Huber's contaminated heavy-tailed rewards and meanwhile needs to ensure differential privacy. We first present its minimax lower bound, characterizing the information-theoretic limit of regret with respect to privacy budget, contamination level and heavy-tailedness. Then, we propose a meta-algorithm that builds on a private and robust mean estimation sub-routine \texttt{PRM} that essentially relies on reward truncation and the Laplace mechanism only. For two different heavy-tailed settings, we give specific schemes of \texttt{PRM}, which enable us to achieve nearly-optimal regret. As by-products of our main results, we also give the first minimax lower bound for private heavy-tailed MABs (i.e., without contamination). Moreover, our two proposed truncation-based \texttt{PRM} achieve the optimal trade-off between estimation accuracy, privacy and robustness. Finally, we support our theoretical results with experimental studies.
    Random Forests for time-fixed and time-dependent predictors: The DynForest R package. (arXiv:2302.02670v1 [stat.ML])
    The R package DynForest implements random forests for predicting a categorical or a (multiple causes) time-to-event outcome based on time-fixed and time-dependent predictors. Through the random forests, the time-dependent predictors can be measured with error at subject-specific times, and they can be endogeneous (i.e., impacted by the outcome process). They are modeled internally using flexible linear mixed models (thanks to lcmm package) with time-associations pre-specified by the user. DynForest computes dynamic predictions that take into account all the information from time-fixed and time-dependent predictors. DynForest also provides information about the most predictive variables using variable importance and minimal depth. Variable importance can also be computed on groups of variables. To display the results, several functions are available such as summary and plot functions. This paper aims to guide the user with a step-by-step example of the different functions for fitting random forests within DynForest.
    Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent. (arXiv:2206.02617v5 [cs.LG] UPDATED)
    Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,\delta)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to investigate individual privacy across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2% higher than that of the class with the highest accuracy.
    RLSbench: Domain Adaptation Under Relaxed Label Shift. (arXiv:2302.03020v1 [cs.LG])
    Despite the emergence of principled methods for domain adaptation under label shift, the sensitivity of these methods for minor shifts in the class conditional distributions remains precariously under explored. Meanwhile, popular deep domain adaptation heuristics tend to falter when faced with shifts in label proportions. While several papers attempt to adapt these heuristics to accommodate shifts in label proportions, inconsistencies in evaluation criteria, datasets, and baselines, make it hard to assess the state of the art. In this paper, we introduce RLSbench, a large-scale relaxed label shift benchmark, consisting of >500 distribution shift pairs that draw on 14 datasets across vision, tabular, and language modalities and compose them with varying label proportions. First, we evaluate 13 popular domain adaptation methods, demonstrating more widespread failures under label proportion shifts than were previously known. Next, we develop an effective two-step meta-algorithm that is compatible with most deep domain adaptation heuristics: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with (an estimate of) target label distribution. The meta-algorithm improves existing domain adaptation heuristics often by 2--10\% accuracy points under extreme label proportion shifts and has little (i.e., <0.5\%) effect when label proportions do not shift. We hope that these findings and the availability of RLSbench will encourage researchers to rigorously evaluate proposed methods in relaxed label shift settings. Code is publicly available at https://github.com/acmi-lab/RLSbench.
    Noise-cleaning the precision matrix of fMRI time series. (arXiv:2302.02951v1 [cond-mat.stat-mech])
    We present a comparison between various algorithms of inference of covariance and precision matrices in small datasets of real vectors, of the typical length and dimension of human brain activity time series retrieved by functional Magnetic Resonance Imaging (fMRI). Assuming a Gaussian model underlying the neural activity, the problem consists in denoising the empirically observed matrices in order to obtain a better estimator of the true precision and covariance matrices. We consider several standard noise-cleaning algorithms and compare them on two types of datasets. The first type are time series of fMRI brain activity of human subjects at rest. The second type are synthetic time series sampled from a generative Gaussian model of which we can vary the fraction of dimensions per sample q = N/T and the strength of off-diagonal correlations. The reliability of each algorithm is assessed in terms of test-set likelihood and, in the case of synthetic data, of the distance from the true precision matrix. We observe that the so called Optimal Rotationally Invariant Estimator, based on Random Matrix Theory, leads to a significantly lower distance from the true precision matrix in synthetic data, and higher test likelihood in natural fMRI data. We propose a variant of the Optimal Rotationally Invariant Estimator in which one of its parameters is optimised by cross-validation. In the severe undersampling regime (large q) typical of fMRI series, it outperforms all the other estimators. We furthermore propose a simple algorithm based on an iterative likelihood gradient ascent, providing an accurate estimation for weakly correlated datasets.
    Decentralized Differentially Private Without-Replacement Stochastic Gradient Descent. (arXiv:1809.02727v4 [cs.LG] UPDATED)
    While machine learning has achieved remarkable results in a wide variety of domains, the training of models often requires large datasets that may need to be collected from different individuals. As sensitive information may be contained in the individual's dataset, sharing training data may lead to severe privacy concerns. Therefore, there is a compelling need to develop privacy-aware machine learning methods, for which one effective approach is to leverage the generic framework of differential privacy. Considering that stochastic gradient descent (SGD) is one of the most commonly adopted methods for large-scale machine learning problems, a decentralized differentially private SGD algorithm is proposed in this work. Particularly, we focus on SGD without replacement due to its favorable structure for practical implementation. Both privacy and convergence analysis are provided for the proposed algorithm. Finally, extensive experiments are performed to demonstrate the effectiveness of the proposed method.
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v4 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    The Dual PC Algorithm and the Role of Gaussianity for Structure Learning of Bayesian Networks. (arXiv:2112.09036v4 [stat.ML] UPDATED)
    Learning the graphical structure of Bayesian networks is key to describing data-generating mechanisms in many complex applications but poses considerable computational challenges. Observational data can only identify the equivalence class of the directed acyclic graph underlying a Bayesian network model, and a variety of methods exist to tackle the problem. Under certain assumptions, the popular PC algorithm can consistently recover the correct equivalence class by reverse-engineering the conditional independence (CI) relationships holding in the variable distribution. The dual PC algorithm is a novel scheme to carry out the CI tests within the PC algorithm by leveraging the inverse relationship between covariance and precision matrices. By exploiting block matrix inversions we can simultaneously perform tests on partial correlations of complementary (or dual) conditioning sets. The multiple CI tests of the dual PC algorithm proceed by first considering marginal and full-order CI relationships and progressively moving to central-order ones. Simulation studies show that the dual PC algorithm outperforms the classic PC algorithm both in terms of run time and in recovering the underlying network structure, even in the presence of deviations from Gaussianity. Additionally, we show that the dual PC algorithm applies for Gaussian copula models, and demonstrate its performance in that setting.
    Side Effects of Learning from Low-dimensional Data Embedded in a Euclidean Space. (arXiv:2203.00614v5 [cs.LG] UPDATED)
    The low-dimensional manifold hypothesis posits that the data found in many applications, such as those involving natural images, lie (approximately) on low-dimensional manifolds embedded in a high-dimensional Euclidean space. In this setting, a typical neural network defines a function that takes a finite number of vectors in the embedding space as input. However, one often needs to consider evaluating the optimized network at points outside the training distribution. This paper considers the case in which the training data is distributed in a linear subspace of $\mathbb R^d$. We derive estimates on the variation of the learning function, defined by a neural network, in the direction transversal to the subspace. We study the potential regularization effects associated with the network's depth and noise in the codimension of the data manifold. We also present additional side effects in training due to the presence of noise.
    Probabilistic Contrastive Learning Recovers the Correct Aleatoric Uncertainty of Ambiguous Inputs. (arXiv:2302.02865v1 [cs.LG])
    Contrastively trained encoders have recently been proven to invert the data-generating process: they encode each input, e.g., an image, into the true latent vector that generated the image (Zimmermann et al., 2021). However, real-world observations often have inherent ambiguities. For instance, images may be blurred or only show a 2D view of a 3D object, so multiple latents could have generated them. This makes the true posterior for the latent vector probabilistic with heteroscedastic uncertainty. In this setup, we extend the common InfoNCE objective and encoders to predict latent distributions instead of points. We prove that these distributions recover the correct posteriors of the data-generating process, including its level of aleatoric uncertainty, up to a rotation of the latent space. In addition to providing calibrated uncertainty estimates, these posteriors allow the computation of credible intervals in image retrieval. They comprise images with the same latent as a given query, subject to its uncertainty.
    The Missing Indicator Method: From Low to High Dimensions. (arXiv:2211.09259v2 [cs.LG] UPDATED)
    Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records.
    Deep Latent State Space Models for Time-Series Generation. (arXiv:2212.12749v3 [stat.ML] UPDATED)
    Methods based on ordinary differential equations (ODEs) are widely used to build generative models of time-series. In addition to high computational overhead due to explicitly computing hidden states recurrence, existing ODE-based models fall short in learning sequence data with sharp transitions - common in many real-world systems - due to numerical challenges during optimization. In this work, we propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE to increase modeling capacity. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4 which bypasses the explicit evaluation of hidden states. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets in the Monash Forecasting Repository, and is capable of modeling highly stochastic data with sharp temporal transitions. LS4 sets state-of-the-art for continuous-time latent generative models, with significant improvement of mean squared error and tighter variational lower bounds on irregularly-sampled datasets, while also being x100 faster than other baselines on long sequences.
    Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization. (arXiv:2102.11050v4 [cs.LG] UPDATED)
    Motivated by online decision-making in time-varying combinatorial environments, we study the problem of transforming offline algorithms to their online counterparts. We focus on offline combinatorial problems that are amenable to a constant factor approximation using a greedy algorithm that is robust to local errors. For such problems, we provide a general framework that efficiently transforms offline robust greedy algorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have $O(\sqrt{T})$ (approximate) regret under the full information setting. We further introduce a bandit extension of Blackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transform greedy robust offline algorithms into a $O(T^{2/3})$ (approximate) regret in the bandit setting. Demonstrating the flexibility of our framework, we apply our offline-to-online transformation to several problems at the intersection of revenue management, market design, and online optimization, including product ranking optimization in online platforms, reserve price optimization in auctions, and submodular maximization. We also extend our reduction to greedy-like first order methods used in continuous optimization, such as those used for maximizing continuous strong DR monotone submodular functions subject to convex constraints. We show that our transformation, when applied to these applications, leads to new regret bounds or improves the current known bounds. We complement our theoretical studies by conducting numerical simulations for two of our applications, in both of which we observe that the numerical performance of our transformations outperforms the theoretical guarantees in practical instances.
    First steps towards quantum machine learning applied to the classification of event-related potentials. (arXiv:2302.02648v1 [cs.HC])
    Low information transfer rate is a major bottleneck for brain-computer interfaces based on non-invasive electroencephalography (EEG) for clinical applications. This led to the development of more robust and accurate classifiers. In this study, we investigate the performance of quantum-enhanced support vector classifier (QSVC). Training (predicting) balanced accuracy of QSVC was 83.17 (50.25) %. This result shows that the classifier was able to learn from EEG data, but that more research is required to obtain higher predicting accuracy. This could be achieved by a better configuration of the classifier, such as increasing the number of shots.
    Adversarial Bandits with Knapsacks. (arXiv:1811.11881v10 [cs.DS] UPDATED)
    We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the "classic" adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to the algorithm's reward. We design an algorithm with competitive ratio O(log T) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version and use it as a subroutine to solve the latter.
    Sparse GCA and Thresholded Gradient Descent. (arXiv:2107.00371v2 [stat.ML] UPDATED)
    Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple datasets. It generalizes canonical correlation analysis that is designed for two datasets. We study sparse GCA when there are potentially multiple generalized correlation tuples in data and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as generalized eigenvalue problems at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic datasets.
    Support Vector Regression: Risk Quadrangle Framework. (arXiv:2212.09178v3 [stat.ML] UPDATED)
    This paper investigates Support Vector Regression (SVR) in the context of the fundamental risk quadrangle paradigm. It is shown that both formulations of SVR, $\varepsilon$-SVR and $\nu$-SVR, correspond to the minimization of equivalent regular error measures (Vapnik error and superquantile (CVaR) norm, respectively) with a regularization penalty. These error measures, in turn, give rise to corresponding risk quadrangles. By constructing the fundamental risk quadrangle, which corresponds to SVR, we show that SVR is the asymptotically unbiased estimator of the average of two symmetric conditional quantiles. Furthermore, the technique used for the construction of quadrangles serves as a powerful tool in proving the equivalence between $\varepsilon$-SVR and $\nu$-SVR. Additionally, SVR is formulated as a regular deviation minimization problem with a regularization penalty by invoking Error Shaping Decomposition of Regression and the dual formulation of SVR in the risk quadrangle framework is derived.
    Random Forest Weighted Local Fr\'echet Regression. (arXiv:2202.04912v2 [stat.ML] UPDATED)
    Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and M\"uller (2019) established a general paradigm of Fr\'echet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fr\'echet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method utilizes these weights as the local average to solve the conditional Fr\'echet mean, while the second method performs local linear Fr\'echet regression, both significantly improving existing Fr\'echet regression methods. Based on the theory of infinite order U-processes and infinite order Mmn -estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to human mortality distribution data.
    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v4 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.
    A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation. (arXiv:2211.14296v2 [cs.LG] UPDATED)
    The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce morphology-task graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a morphology-task graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology-task generalization.
    SE(3) diffusion model with application to protein backbone generation. (arXiv:2302.02277v1 [cs.LG])
    The design of novel protein structures remains a challenge in protein engineering for applications across biomedicine and chemistry. In this line of work, a diffusion model over rigid bodies in 3D (referred to as frames) has shown success in generating novel, functional protein backbones that have not been observed in nature. However, there exists no principled methodological framework for diffusion on SE(3), the space of orientation preserving rigid motions in R3, that operates on frames and confers the group invariance. We address these shortcomings by developing theoretical foundations of SE(3) invariant diffusion models on multiple frames followed by a novel framework, FrameDiff, for learning the SE(3) equivariant score over multiple frames. We apply FrameDiff on monomer backbone generation and find it can generate designable monomers up to 500 amino acids without relying on a pretrained protein structure prediction network that has been integral to previous methods. We find our samples are capable of generalizing beyond any known protein structure.
    Direct Uncertainty Quantification. (arXiv:2302.02420v1 [cs.LG])
    Traditional neural networks are simple to train but they produce overconfident predictions, while Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming. This paper introduces a new approach, direct uncertainty quantification (DirectUQ), that combines their advantages where the neural network directly models uncertainty in output space, and captures both aleatoric and epistemic uncertainty. DirectUQ can be derived as an alternative variational lower bound, and hence benefits from collapsed variational inference that provides improved regularizers. On the other hand, like non-probabilistic models, DirectUQ enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. Experiments show that DirectUQ and ensembles of DirectUQ provide a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.
    Multi-Center Federated Learning: Clients Clustering for Better Personalization. (arXiv:2005.01026v3 [cs.LG] UPDATED)
    Federated learning has received great attention for its capability to train a large-scale model in a decentralized manner without needing to access user data directly. It helps protect the users' private data from centralized collecting. Unlike distributed machine learning, federated learning aims to tackle non-IID data from heterogeneous sources in various real-world applications, such as those on smartphones. Existing federated learning approaches usually adopt a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. However, due to the diverse nature of user behaviors, assigning users' gradients to different global models (i.e., centers) can better capture the heterogeneity of data distributions across users. Our paper proposes a novel multi-center aggregation mechanism for federated learning, which learns multiple global models from the non-IID user data and simultaneously derives the optimal matching between users and centers. We formulate the problem as a joint optimization that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Our experimental results on benchmark datasets show that our method outperforms several popular federated learning methods.
    A Permutation-free Kernel Two-Sample Test. (arXiv:2211.14908v2 [stat.ME] UPDATED)
    The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance metric between distributions that has found utility in two-sample testing. The usual kernel-MMD test statistic is a degenerate U-statistic under the null, and thus it has an intractable limiting distribution. Hence, to design a level-$\alpha$ test, one usually selects the rejection threshold as the $(1-\alpha)$-quantile of the permutation distribution. The resulting nonparametric test has finite-sample validity but suffers from large computational cost, since every permutation takes quadratic time. We propose the cross-MMD, a new quadratic-time MMD test statistic based on sample-splitting and studentization. We prove that under mild assumptions, the cross-MMD has a limiting standard Gaussian distribution under the null. Importantly, we also show that the resulting test is consistent against any fixed alternative, and when using the Gaussian kernel, it has minimax rate-optimal power against local alternatives. For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.  ( 2 min )
    Improving Fair Training under Correlation Shifts. (arXiv:2302.02323v1 [cs.LG])
    Model fairness is an essential element for Trustworthy AI. While many techniques for model fairness have been proposed, most of them assume that the training and deployment data distributions are identical, which is often not true in practice. In particular, when the bias between labels and sensitive groups changes, the fairness of the trained model is directly influenced and can worsen. We make two contributions for solving this problem. First, we analytically show that existing in-processing fair algorithms have fundamental limits in accuracy and group fairness. We introduce the notion of correlation shifts, which can explicitly capture the change of the above bias. Second, we propose a novel pre-processing step that samples the input data to reduce correlation shifts and thus enables the in-processing approaches to overcome their limitations. We formulate an optimization problem for adjusting the data ratio among labels and sensitive groups to reflect the shifted correlation. A key benefit of our approach lies in decoupling the roles of pre- and in-processing approaches: correlation adjustment via pre-processing and unfairness mitigation on the processed data via in-processing. Experiments show that our framework effectively improves existing in-processing fair algorithms w.r.t. accuracy and fairness, both on synthetic and real datasets.  ( 2 min )
    Bayesian Fixed-Budget Best-Arm Identification. (arXiv:2211.08572v2 [cs.LG] UPDATED)
    Fixed-budget best-arm identification (BAI) is a bandit problem where the agent maximizes the probability of identifying the optimal arm within a fixed budget of observations. In this work, we study this problem in the Bayesian setting. We propose a Bayesian elimination algorithm and derive an upper bound on its probability of misidentifying the optimal arm. The bound reflects the quality of the prior and is the first distribution-dependent bound in this setting. We prove it using a frequentist-like argument, where we carry the prior through, and then integrate out the bandit instance at the end. We also provide the first lower bound on the probability of misidentification in a $2$-armed Bayesian bandit and show that our upper bound (almost) matches the lower bound. Our experiments show that Bayesian elimination is superior to frequentist methods and competitive with the state-of-the-art Bayesian algorithms that have no guarantees in our setting.  ( 2 min )
    U-Clip: On-Average Unbiased Stochastic Gradient Clipping. (arXiv:2302.02971v1 [cs.LG])
    U-Clip is a simple amendment to gradient clipping that can be applied to any iterative gradient optimization algorithm. Like regular clipping, U-Clip involves using gradients that are clipped to a prescribed size (e.g. with component wise or norm based clipping) but instead of discarding the clipped portion of the gradient, U-Clip maintains a buffer of these values that is added to the gradients on the next iteration (before clipping). We show that the cumulative bias of the U-Clip updates is bounded by a constant. This implies that the clipped updates are unbiased on average. Convergence follows via a lemma that guarantees convergence with updates $u_i$ as long as $\sum_{i=1}^t (u_i - g_i) = o(t)$ where $g_i$ are the gradients. Extensive experimental exploration is performed on CIFAR10 with further validation given on ImageNet.  ( 2 min )
    Adapting to Continuous Covariate Shift via Online Density Ratio Estimation. (arXiv:2302.02552v1 [cs.LG])
    Dealing with distribution shifts is one of the central challenges for modern machine learning. One fundamental situation is the \emph{covariate shift}, where the input distributions of data change from training to testing stages while the input-conditional output distribution remains unchanged. In this paper, we initiate the study of a more challenging scenario -- \emph{continuous} covariate shift -- in which the test data appear sequentially, and their distributions can shift continuously. Our goal is to adaptively train the predictor such that its prediction risk accumulated over time can be minimized. Starting with the importance-weighted learning, we show the method works effectively if the time-varying density ratios of test and train inputs can be accurately estimated. However, existing density ratio estimation methods would fail due to data scarcity at each time step. To this end, we propose an online method that can appropriately reuse historical information. Our density ratio estimation method is proven to perform well by enjoying a dynamic regret bound, which finally leads to an excess risk guarantee for the predictor. Empirical results also validate the effectiveness.  ( 2 min )
    Reinforcement Learning with History-Dependent Dynamic Contexts. (arXiv:2302.02061v1 [cs.LG])
    We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.  ( 2 min )
    ODEWS: The Overdraft Early Warning System. (arXiv:2302.02455v1 [stat.ML])
    When a customer overdraws their account and their balance is negative they are assessed an overdraft fee. Americans pay approximately \$15 billion in unnecessary overdraft fees a year, often in \$35 increments; users of the Mint personal finance app pay approximately \$250 million in fees a year in particular. These overdraft fees are an excessive financial burden and lead to cascading overdraft fees trapping customers in financial hardship. To address this problem, we have created an ML-driven overdraft early warning system (ODEWS) that assesses a customer's risk of overdrafting within the next week using their banking and transaction data in the Mint app. At-risk customers are sent an alert so they can take steps to avoid the fee, ultimately changing their behavior and financial habits. The system deployed resulted in a \$3 million savings in overdraft fees for Mint customers compared to a control group. Moreover, the methodology outlined here can be generalized to provide ML-driven personalized financial advice for many different personal finance goals--increase credit score, build emergency savings fund, pay down debut, allocate capital for investment.  ( 2 min )
    Hierarchical Sliced Wasserstein Distance. (arXiv:2209.13570v5 [stat.ML] UPDATED)
    Sliced Wasserstein (SW) distance has been widely used in different application scenarios since it can be scaled to a large number of supports without suffering from the curse of dimensionality. The value of sliced Wasserstein distance is the average of transportation cost between one-dimensional representations (projections) of original measures that are obtained by Radon Transform (RT). Despite its efficiency in the number of supports, estimating the sliced Wasserstein requires a relatively large number of projections in high-dimensional settings. Therefore, for applications where the number of supports is relatively small compared with the dimension, e.g., several deep learning applications where the mini-batch approaches are utilized, the complexities from matrix multiplication of Radon Transform become the main computational bottleneck. To address this issue, we propose to derive projections by linearly and randomly combining a smaller number of projections which are named bottleneck projections. We explain the usage of these projections by introducing Hierarchical Radon Transform (HRT) which is constructed by applying Radon Transform variants recursively. We then formulate the approach into a new metric between measures, named Hierarchical Sliced Wasserstein (HSW) distance. By proving the injectivity of HRT, we derive the metricity of HSW. Moreover, we investigate the theoretical properties of HSW including its connection to SW variants and its computational and sample complexities. Finally, we compare the computational cost and generative quality of HSW with the conventional SW on the task of deep generative modeling using various benchmark datasets including CIFAR10, CelebA, and Tiny ImageNet.  ( 2 min )
    TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Data. (arXiv:2302.02224v1 [cs.LG])
    This work investigates the intersection of cross modal learning and semi supervised learning, where we aim to improve the supervised learning performance of the primary modality by borrowing missing information from an unlabeled modality. We investigate this problem from a Nadaraya Watson (NW) kernel regression perspective and show that this formulation implicitly leads to a kernelized cross attention module. To this end, we propose The Attention Patch (TAP), a simple neural network plugin that allows data level knowledge transfer from the unlabeled modality. We provide numerical simulations on three real world datasets to examine each aspect of TAP and show that a TAP integration in a neural network can improve generalization performance using the unlabeled modality.  ( 2 min )
    Refined Value-Based Offline RL under Realizability and Partial Coverage. (arXiv:2302.02392v1 [cs.LG])
    In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.  ( 2 min )
    Tighter Information-Theoretic Generalization Bounds from Supersamples. (arXiv:2302.02432v1 [stat.ML])
    We present a variety of novel information-theoretic generalization bounds for learning algorithms, from the supersample setting of Steinke & Zakynthinou (2020)-the setting of the "conditional mutual information" framework. Our development exploits projecting the loss pair (obtained from a training instance and a testing instance) down to a single number and correlating loss values with a Rademacher sequence (and its shifted variants). The presented bounds include square-root bounds, fast-rate bounds, including those based on variance and sharpness, and bounds for interpolating algorithms etc. We show theoretically or empirically that these bounds are tighter than all information-theoretic bounds known to date on the same supersample setting.  ( 2 min )
    Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs. (arXiv:2106.01128v2 [cs.LG] UPDATED)
    The ability to align points across two related yet incomparable point clouds (e.g. living in different spaces) plays an important role in machine learning. The Gromov-Wasserstein (GW) framework provides an increasingly popular answer to such problems, by seeking a low-distortion, geometry-preserving assignment between these points. As a non-convex, quadratic generalization of optimal transport (OT), GW is NP-hard. While practitioners often resort to solving GW approximately as a nested sequence of entropy-regularized OT problems, the cubic complexity (in the number $n$ of samples) of that approach is a roadblock. We show in this work how a recent variant of the OT problem that restricts the set of admissible couplings to those having a low-rank factorization is remarkably well suited to the resolution of GW: when applied to GW, we show that this approach is not only able to compute a stationary point of the GW problem in time $O(n^2)$, but also uniquely positioned to benefit from the knowledge that the initial cost matrices are low-rank, to yield a linear time $O(n)$ GW approximation. Our approach yields similar results, yet orders of magnitude faster computation than the SoTA entropic GW approaches, on both simulated and real data.  ( 2 min )
    On Over-Squashing in Message Passing Neural Networks: The Impact of Width, Depth, and Topology. (arXiv:2302.02941v1 [cs.LG])
    Message Passing Neural Networks (MPNNs) are instances of Graph Neural Networks that leverage the graph to send messages over the edges. This inductive bias leads to a phenomenon known as over-squashing, where a node feature is insensitive to information contained at distant nodes. Despite recent methods introduced to mitigate this issue, an understanding of the causes for over-squashing and of possible solutions are lacking. In this theoretical work, we prove that: (i) Neural network width can mitigate over-squashing, but at the cost of making the whole network more sensitive; (ii) Conversely, depth cannot help mitigate over-squashing: increasing the number of layers leads to over-squashing being dominated by vanishing gradients; (iii) The graph topology plays the greatest role, since over-squashing occurs between nodes at high commute (access) time. Our analysis provides a unified framework to study different recent methods introduced to cope with over-squashing and serves as a justification for a class of methods that fall under `graph rewiring'.  ( 2 min )
    Guaranteed Tensor Recovery Fused Low-rankness and Smoothness. (arXiv:2302.02155v1 [cs.LG])
    The tensor data recovery task has thus attracted much research attention in recent years. Solving such an ill-posed problem generally requires to explore intrinsic prior structures underlying tensor data, and formulate them as certain forms of regularization terms for guiding a sound estimate of the restored tensor. Recent research have made significant progress by adopting two insightful tensor priors, i.e., global low-rankness (L) and local smoothness (S) across different tensor modes, which are always encoded as a sum of two separate regularization terms into the recovery models. However, unlike the primary theoretical developments on low-rank tensor recovery, these joint L+S models have no theoretical exact-recovery guarantees yet, making the methods lack reliability in real practice. To this crucial issue, in this work, we build a unique regularization term, which essentially encodes both L and S priors of a tensor simultaneously. Especially, by equipping this single regularizer into the recovery models, we can rigorously prove the exact recovery guarantees for two typical tensor recovery tasks, i.e., tensor completion (TC) and tensor robust principal component analysis (TRPCA). To the best of our knowledge, this should be the first exact-recovery results among all related L+S methods for tensor recovery. Significant recovery accuracy improvements over many other SOTA methods in several TC and TRPCA tasks with various kinds of visual tensor data are observed in extensive experiments. Typically, our method achieves a workable performance when the missing rate is extremely large, e.g., 99.5%, for the color image inpainting task, while all its peers totally fail in such challenging case.  ( 2 min )
    An Asymptotically Optimal Algorithm for the One-Dimensional Convex Hull Feasibility Problem. (arXiv:2302.02033v1 [stat.ML])
    This work studies the pure-exploration setting for the convex hull feasibility (CHF) problem where one aims to efficiently and accurately determine if a given point lies in the convex hull of means of a finite set of distributions. We give a complete characterization of the sample complexity of the CHF problem in the one-dimensional setting. We present the first asymptotically optimal algorithm called Thompson-CHF, whose modular design consists of a stopping rule and a sampling rule. In addition, we provide an extension of the algorithm that generalizes several important problems in the multi-armed bandit literature. Finally, we further investigate the Gaussian bandit case with unknown variances and address how the Thompson-CHF algorithm can be adjusted to be asymptotically optimal in this setting.  ( 2 min )
    Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation. (arXiv:2302.02570v1 [cs.AI])
    We consider the task of evaluating policies of algorithmic resource allocation through randomized controlled trials (RCTs). Such policies are tasked with optimizing the utilization of limited intervention resources, with the goal of maximizing the benefits derived. Evaluation of such allocation policies through RCTs proves difficult, notwithstanding the scale of the trial, because the individuals' outcomes are inextricably interlinked through resource constraints controlling the policy decisions. Our key contribution is to present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT. We identify conditions under which such reassignments are permissible and can be leveraged to construct counterfactual trials, whose outcomes can be accurately ascertained, for free. We prove theoretically that such an estimator is more accurate than common estimators based on sample means -- we show that it returns an unbiased estimate and simultaneously reduces variance. We demonstrate the value of our approach through empirical experiments on synthetic, semi-synthetic as well as real case study data and show improved estimation accuracy across the board.  ( 2 min )
    Polynomial-time sparse measure recovery. (arXiv:2204.07879v3 [cs.LG] UPDATED)
    Many problems in computer science reduce to the recovery of an $n$-sparse measure from its (generalized) moments. Sparse measure recovery has been the research focus in super-resolution, tensor decomposition, and learning neural networks. The existing methods use either convex relaxations or overparameterization for recovery. Here, we propose recovery with non-convex optimization without overparameterization. Our algorithm is a (sub)gradient descent method optimizing a non-convex energy function studied in physics. We establish the global convergence of gradient descent on the energy function. This result enables us to solve super-resolution in $O(n^2)$ time, which significantly improves upon $O(n^3)$ time for solving convex relaxations. For a particular neural network, we prove the global convergence of subgradient descent on the population loss without overparameterization. The studied network has zero-one activations, and inputs drawn from the unit sphere.  ( 2 min )
    Offline Learning in Markov Games with General Function Approximation. (arXiv:2302.02571v1 [cs.LG])
    We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium -- such as Nash equilibrium and (Coarse) Correlated Equilibrium -- from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation, handling all 3 equilibria in a unified manner. By using Bellman-consistent pessimism, we obtain interval estimation for policies' returns, and use both the upper and the lower bounds to obtain a relaxation on the gap of a candidate policy, which becomes our optimization objective. Our results generalize prior works and provide several additional insights. Importantly, we require a data coverage condition that improves over the recently proposed "unilateral concentrability". Our condition allows selective coverage of deviation policies that optimally trade-off between their greediness (as approximate best responses) and coverage, and we show scenarios where this leads to significantly better guarantees. As a new connection, we also show how our algorithmic framework can subsume seemingly different solution concepts designed for the special case of two-player zero-sum games.  ( 2 min )
    On a continuous time model of gradient descent dynamics and instability in deep learning. (arXiv:2302.01952v1 [stat.ML])
    The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.  ( 2 min )
    Domain Adaptation via Rebalanced Sub-domain Alignment. (arXiv:2302.02009v1 [cs.LG])
    Unsupervised domain adaptation (UDA) is a technique used to transfer knowledge from a labeled source domain to a different but related unlabeled target domain. While many UDA methods have shown success in the past, they often assume that the source and target domains must have identical class label distributions, which can limit their effectiveness in real-world scenarios. To address this limitation, we propose a novel generalization bound that reweights source classification error by aligning source and target sub-domains. We prove that our proposed generalization bound is at least as strong as existing bounds under realistic assumptions, and we empirically show that it is much stronger on real-world data. We then propose an algorithm to minimize this novel generalization bound. We demonstrate by numerical experiments that this approach improves performance in shifted class distribution scenarios compared to state-of-the-art methods.  ( 2 min )

  • Open

    [P] AI/ML Engineering Tutor
    I am looking for a AI/Machine Learning Engineering tutor. I am specifically interested in help developing and building ~2 Medium/large NLP projects, using SOTA LLM’s (GPT3, etc.). After these, I would potentially be interested in diving into RL. I am looking for a little bit of theory when needed, but predominantly hands-on coding project help (eg. real-time coding help if I get stuck). Reach out if you are interested! (Of course $ negotiated). submitted by /u/ThoseWhoAbandonViews [link] [comments]  ( 42 min )
    Is there Powerpoint Ai maker that works ? [D]
    It seems the others like tome ai, and chatbcg are completely trafficked out. Also you cant even save the powerpoints, so please is there one that i can use. I also have my own text that id like to turn into a powerpoint submitted by /u/Ancient_Vegetable_62 [link] [comments]  ( 43 min )
    [Discussion] Can an AMD Ryzen 5 3400G computer with 16GB of RAM effectively train an AI model?
    I'm exploring the possibility of using my AMD Ryzen 5 3400G computer with 16GB of RAM to train an AI model. I'm curious to know if this setup is adequate for the task and, if so, what kind of AI models would be appropriate. I'm interested in understanding any limitations and drawbacks that I may face with this setup. If you have any relevant experience or information, I would greatly appreciate your participation in this discussion. Thanks! <3 submitted by /u/erikaonline [link] [comments]  ( 43 min )
    [D] Image object detection, but for 1 dimensional data?
    I have had a lot of fun and success using YOLO and other image object detection models on 2D or 3D image data for personal projects. I am now working on some projects where I need to scan long periods of timeseries data and find specific waveforms that are variable durations. Are there techniques or models that function like YOLO that can scan large amounts of data and only highlight specific segments of interest as specific classes? If it doesn’t exist, I wonder how well the underlying CNN architecture of YOLO would translate to 1 dimensional CNN architectures. Any info is appreciated, thanks! submitted by /u/Optoplasm [link] [comments]  ( 43 min )
    [P] Best way to add a sampling step within a neural network end-to-end?
    I'm looking to combine two separate models together end-to-end, but need help understanding the best way to connect discrete parts. The first part: I trained a classifier that given an input vector (512 dimensional) is able to predict one of twenty possible labels. The second part: given an input label (from the previous classifier), embed the label and use that label to make a prediction. Both models work decently, but I'm wondering if I can make this end-to-end and get some serious gains. To do this, I'd need a way of sampling from the first softmax. Once I have a sample, I can get the embedding of the sampled class, continue as normal, and hopefully propagate the loss through everything. Are there any similar examples I can look at? Is there a term for this in the literature? submitted by /u/geomtry [link] [comments]  ( 44 min )
    [P] We've built ChatGPT for your pdf files.
    My friends and I were really excited to try out ChatGPT when it was released. We were amazed by its power and capabilities. We were thinking how great it would be if people could use ChatGPT to ask questions from their own data. So me and my friends had decided to build such a tool. We are delighted to announce that we are releasing a public beta of our tool> After you upload all your pdf files into it. It can search for relevant documents without perfect keyword match, summarize takeaways from the document specific to your question, and extract key information from the document. It can help with brainstorming and summarization. We would love if you try it out and let us know what you think about it. We look forward to hearing your feedback Link: https://askcorpora.com submitted by /u/Brian-Hose225 [link] [comments]  ( 43 min )
    [D] Can output time frame cover input time frame in machine learning?
    I recently had a disagreement with a friend and would like to hear other opinions. Say for a website, using the user actions for first week period, we want to predict total sales within 3 weeks. But one of the inputs is sales in the first week, so the output -total sales of 3 weeks- is including the sales in the first week. Is it ok to choose this output? Or should we adjust it in a way to prevent it from overlapping with the input time period and choose for ex. sales within 2 weeks after the first week for output What is the reasoning? submitted by /u/dencan06 [link] [comments]  ( 43 min )
    [N] Microsoft announces new "next-generation" LLM, will be integrated with Bing and Edge
    https://www.theverge.com/2023/2/7/23587454/microsoft-bing-edge-chatgpt-ai submitted by /u/currentscurrents [link] [comments]  ( 44 min )
    [Discussion] Is ChatGPT and/or OpenAI really the leader in the space?
    Or is it someone else (who just may or may not be as well known)? submitted by /u/wonderingandthinking [link] [comments]  ( 44 min )
    [D] Multi-class classifications when a few of the classes are not mutually-exclusive
    I am dealing with a multi-class classification problem. I know one of the main assumption of this problem is that the classes are mutually exclusive. However, I realized that in my problem, some of these classes may happen together. So my problem is not an entirely a milt-class nor a multi-label. One solution is to relax the exclusivity assumption and fit a model, however, I am not sure how realistic is that. I was wondering if there is a better way to approach this problem? Briefly, the problem is in ads domain where a user can do task A or B after seeing an ad or can do both A&B at the same time. submitted by /u/hopedallas [link] [comments]  ( 43 min )
    [D] Which is the fastest and lightweight ultra realistic TTS for real-time voice cloning?
    Hey everyone, I want to make a personal voice assistant who sounds exactly like a real person. I tried some TTS like tortoise TTS and coqui TTS, it done a good job but it takes too long time to perform. So is there any other good realistic sounding TTS which I can use with my own voice cloning training dataset? Also I'm a bit amazed by the TTS used by eleven labs, so can someone explain how can I achieve that level of real-time efficiency in a voice assistant? submitted by /u/akshaysri0001 [link] [comments]  ( 43 min )
    [D] Artificial Intelligence for Manufacturing
    Manufacturing 4.0 is undergoing a revolution with the integration of Artificial Intelligence (AI). AI is poised to revolutionize the process industry, where controlling input variables leads to an output. The current process industry, including pharmaceuticals, chemicals, and energy production, relies on human operators to turn knobs to achieve optimal output. However, this system is limited by several factors, including slow training, poor retention of large data sets, inaccurate sensors, and complex decision-making processes. ​ https://preview.redd.it/y6stc52zqsga1.png?width=734&format=png&auto=webp&s=c8f24ff9d7f7975380e6cd2fbb5021fce67c0e86 Here are some details about problems and AI solutions: 1) It takes forever to train this employee. This employee is running little mini experi…  ( 45 min )
    [N] Getty Images Claims Stable Diffusion Has Stolen 12 Million Copyrighted Images, Demands $150,000 For Each Image
    From Article: Getty Images new lawsuit claims that Stability AI, the company behind Stable Diffusion's AI image generator, stole 12 million Getty images with their captions, metadata, and copyrights "without permission" to "train its Stable Diffusion algorithm." The company has asked the court to order Stability AI to remove violating images from its website and pay $150,000 for each. However, it would be difficult to prove all the violations. Getty submitted over 7,000 images, metadata, and copyright registration, used by Stable Diffusion. submitted by /u/vadhavaniyafaijan [link] [comments]  ( 49 min )
    [D] question on first time training a model
    Basically I saw a stream the other day where someone used data from a person's YouTube channel and somehow used this data to create an AI version of them and interviewed them. It was fascinating and pretty accurate How difficult would this be to do myself? I don't even know where to start. Does anyone have any pointers? Is this a very large task that I'm underestimating or is it actually feasible? Here is the stream in question. The video and audio would be cool to have but I mean that's not necessary, even just having the text aspect would be pretty wild on its own. https://youtu.be/hjoYy5IVtfo (skip to any point, most of it is filled with the bot responding) submitted by /u/cobalt1137 [link] [comments]  ( 43 min )
    Model/paper ideas: reinforcement learning with a deterministic environment [D]
    I have a problem I need to solve that, as far as I can tell, doesn't fit very well into most of the existing RL literature. Essentially the task is to create on optimal plan over a time horizon extending a flexible number of steps into the future. The action space is both discrete and continuous - there are multiple available distinct actions, some of which need to be given continuous (but constrained) parameters. In this problem however, the state of the environment is known ahead of time for all the future time steps, and the updated state of the agent after each action can be calculated deterministically given the action and the environment state. Modelling the entire problem as a MILP is not feasible due to the size of the action and state space, and we have a very large data set for agent and environment state to play with. Does anyone have any suggestions for papers or models that might be appropriate for this scenario? submitted by /u/EmbarrassedFuel [link] [comments]  ( 44 min )
    [N] Beyond Transformers with PyNeuraLogic
    Going beyond Transformers? 🤖 In this article, I'm discussing how we can use the power of hybrid architecture, i.e., marrying deep learning with symbolic artificial intelligence, for implementing different kinds of Transformers. Including the one used in GPT-3! https://towardsdatascience.com/beyond-transformers-with-pyneuralogic-10b70cdc5e45 ​ ​ The attention computation graph visualized submitted by /u/Lukas_Zahradnik [link] [comments]  ( 42 min )
    [D] Name your favourite Github repositories for data scientists
    I thought it may be useful to gather the most popular repositories for data scientists. The goal is to read excellent code and learn from other projects. Please provide a short description of the project. submitted by /u/Illustrious-Law-2556 [link] [comments]  ( 42 min )
    [Discussion] Best practices for taking deep learning models to bare metal MCUs
    I would like to know what are some of the best practice is to convert pytorch to embedded C (bare metal micro-controllers) during A. initial phase and B. for deployment. A. Initial phase is to understand the profiling of the model performance (RAM usage and processing time) for a targetted hardware. I understand that Tensorflow lite might be the best route for initial profiling but there are restrictions. It will be great if you could tell the framework that you follow. Currently framework: 1. Pytorch -> 2. ONNX -> 3. Keras -> 4. Tensorflowlite or 5. Tensorflowlite micro B. Deployment is to run inference for production in a targetted hardware. I think hand coding in C is the best way. Please ignore optimisation techniques in the workflow for simplicity. submitted by /u/ramv0001 [link] [comments]  ( 43 min )
    [D] A ML-powered music description/tag generator? (a reverse-MusicML)?
    I know there are no useful text-to-music generators (YET), but is there at least a model where you can upload/input a recording and get a text description/hashtag list from it? Like a reverse MusicML? I have a very large personal catalog of music I am prepping for sale (approaching 500 songs), and this would be a very handy tool, especially if it came up with tags of genres/similar artists I am not aware of. submitted by /u/dreternal [link] [comments]  ( 43 min )
    [D] Papers that inject embeddings into LMs
    I am looking for papers that inject information into LMs directly using embeddings (without formatting information as text). I find it notoriously hard to search for these paper because they could come from various different domains, so I thought asking here might be a good option to reach people from many different domains. Some examples I already found are from the domain of knowledge graph augmented LMs: ERNIE https://arxiv.org/abs/1904.09223 K-BERT https://arxiv.org/abs/1909.07606 Prefix Tuning / Prompt Tuning are also somewhat similar to the idea, but they dont depend on any external information. Can you think of other papers that inject additional information into LMs via embeddings? submitted by /u/_Arsenie_Boca_ [link] [comments]  ( 44 min )
    [P] Pythae 0.1.0 is out and supports distributed training for 25 Variational Autoencoders
    📢 News 📢 Pythae 0.1.0 is now out and supports distributed training using PyTorch DDP ! Train your favorite Variational Autoencoders (VAEs) faster 🏎️ and on larger datasets, still with a few lines of code 🖥️. 👉github: https://github.com/clementchadebec/benchmark_VAE 👉pypi: https://pypi.org/project/pythae/ https://preview.redd.it/jk4ukkgarpga1.png?width=1335&format=png&auto=webp&s=07c1ab2eaad104879637ad04472935d87baa31e9 submitted by /u/cchad-8 [link] [comments]  ( 43 min )
    [P] ChatGPT without size limits: upload any pdf and apply any prompt to it
    hi all! I created a simple free tool where you can summarize and query documents of any size and estimate the cost to do so: https://www.wrotescan.com You can edit the prompts as well as automatically chunk and combine documents. There's also a cost estimator for any pdf you upload. Let me know if you want me to run some examples for you! Send me a pdf and tell me what you'd like summarized or extracted. Tips Please be sure to keep {text} in both prompts or the program will not input your document's text into the map reduce summarizer. {text} can only appear once in each prompt. It is where the text from each chunk to be summarized is input into the prompts. Create a temporary OpenAI key / org to use with this site so you do not have to provide credit card information then be sure to delete the temp key when you are done. Learnings Some interesting learnings I had while creating the tool: - Minimizing the number of steps through the AI improved summarization, so map reduce was often better than a more advanced refine workflow which passes the output through the model many more times. - LangChain is great for managing multiple step language model calls and bypassing the current limitations of ChatGPT submitted by /u/aicharades [link] [comments]  ( 50 min )
  • Open

    Im making Faceless Youtube videos with this AI software. Ultra easy and super fast.
    submitted by /u/aortigby [link] [comments]  ( 40 min )
    Bright Eye: free mobile AI app that generates art, code, text (short stories, poems, answers), analyzes photos, and more!
    Hey guys, I’m the co-founder of a tech startup focused on providing free AI services. We’re one of the first mobile multipurpose AI apps. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA (Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature, story-lines, jokes, general information, etc. We’d love to have some people try it out, give us feedback, and keep in touch with us. https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/BrightEyeuser [link] [comments]  ( 41 min )
    Reimagining how developer-facing docs work using GPT-3
    Developers, have you ever found yourself stuck on a problem, searching through countless pages of outdated documentation, only to end up in a developer community asking for help? We know the feeling all too well. In response, we created https://www.mendable.ai/. Mendable is a GPT-3 (davinci + LangChain) powered developer support AI, designed to provide you with customized answers to your problems in real-time. We believe that developer-facing documentation should be tailored to the individual. We ingest documentation, community messages, and resources (blog posts, how-to guides, white papers, etc.) to serve as a middle-man between the entire knowledge base of a piece of software and you, resulting in a GPT like docs search. Our goal is to redefine search, making it easier for you to find the information you need, when you need it. Wasting time searching through outdated documentation is something that should be in the past. If you want to learn more about what we’re doing, feel free to visit our website and hit us up! submitted by /u/SideGuideDevs [link] [comments]  ( 41 min )
    Google REVEALS ChatGPT Rival Bard AI!
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    AI Dream 51 - InstantArt : Beeple inspired Synthwave
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Microsoft announces AI-powered Bing search and Edge browser
    submitted by /u/citizentim [link] [comments]  ( 40 min )
    Improved ChatGPT comes to Microsoft Bing and Edge
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Created an AI database tool where you ask questions and it generates the query code. It's like a query co-pilot.
    submitted by /u/Mogen1000 [link] [comments]  ( 42 min )
    🚨 If 2022 is the year of pixels for generative AI, then 2023 is the year of sound waves
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
    Introducing Bard: Google’s ChatGPT Competitor
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Is Jasper AI Worth Your Investment ? A Comprehensive Review
    Hello everyone! Have you heard of Jasper AI, the new copywriting software that promises to revolutionize the way we create content? In today's fast-paced world, time is a precious commodity, and finding ways to streamline our work is crucial. That's where Jasper AI comes in - it claims to be a one-stop-shop for all your copywriting needs. But is it worth the hype? First things first, what exactly is Jasper AI? It's an artificial intelligence tool that uses machine learning algorithms to generate high-quality, unique and compelling content for your website, social media, email, and other marketing channels. All you have to do is input the desired tone and style, topic, and target audience, and Jasper AI will generate a complete copy for you. It's like having a team of copywriters working …  ( 43 min )
    GIF2GIF Extension In Stable Diffusion! Easy Gif Processing!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    AI won’t make artists redundant - thanks to information theory
    submitted by /u/pmigdal [link] [comments]  ( 42 min )
    OpenAssistant | ChatGPT's Open Alternative (We need your help!)
    submitted by /u/pentin0 [link] [comments]  ( 40 min )
    AI: THE FUTURE REPLACEMENT OF HUMANS?
    submitted by /u/nowadayswow [link] [comments]  ( 40 min )
    A new video transformation model from Runway.ml
    https://medium.com/seeds-for-the-future/the-next-step-for-generative-ai-830112890d04?sk=1d6b4c96cc6cb0a4690bcf9df0d12bcc submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    skeptic ai books/audiobooks/documentaries like unabomer's manifeso on the technology?
    i want to listen or read or watch something that predicts all the bad things (maybe in the guise of good) like how unabomber wrote his manifesto for the technological revolution but for the ai revolution submitted by /u/tonyhyeok [link] [comments]  ( 41 min )
    Benedict Cumberbatch calling the Dr
    submitted by /u/Esportage [link] [comments]  ( 39 min )
    AI voice Quentin Tarantino recalls what inspired him to create Kill Bill (via Eleven Labs)
    submitted by /u/CoolkidRR [link] [comments]  ( 41 min )
    I spent half a year doing research and testing to develop an AI tool which creates the perfect long-form blog articles
    Good content is key no matter what type of website or business you run, from blogs to SaaS tools or service based companies. Not only will it help you to rank higher in Google for the relevant keywords, but it also helps to attract visitors by providing them something of value for free to convert them into your funnel with a newsletter or free trial. Usually creating this content either required a lot of time, a lot of money, or both. That is why I launched https://writeseed.com It is powered by GPT-3 to create content for you with the help of AI. You only need to provide it with a general niche or keyword and it will provide you with a selection of blog post outlines, which are then used to write a complete 1,000+ word article. You also get a free stock photo which is relevant to the topic each time. The quality of the content is so good, I often get the feedback that people are surprised this is possible at all. We achieve these results by using our own proprietary fine-tuning, as well as a special way of processing your input and the output from GPT-3. It took me half a year of research and comparing the outputs of other AI writing tools to get to this point and I am really proud of the result. Besides blog articles the platform offers over 20 templates of content, from product descriptions to Tweets, cold emails, Quora answers, questions, jokes, stories, paraphrasing or summarizing your existing content, auto-correct spelling and grammar, press releases, real-estate listings. And we continuously add new templates to cover even more use-cases. On top of that we are one of the very few, maybe the only tool, which offers you unlimited content for a flat monthly price. Some of our users are creating 200+ full-length articles per month to rank their websites or run Amazon product blogs. Please give it a try with the 7-day free trial, of course you can also create unlimited content during the free trial! submitted by /u/spacpro [link] [comments]  ( 44 min )
    Google's Bard will be more useful and interesting than ChatGPT as it has Internet access
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    You.com released v2 of YouChat, adding multimedia content to their chat agent for search
    submitted by /u/quanik_314 [link] [comments]  ( 40 min )
    New AI Streamer 24/7
    ​ https://preview.redd.it/nidkmqbtsnga1.png?width=1536&format=png&auto=webp&s=27719de8800d612139e93fe177e2fa63f03f4e0e https://www.twitch.tv/ai_media ​ Help me to learn by chatting with me! submitted by /u/BlueBug02 [link] [comments]  ( 40 min )
  • Open

    How to deal with planning before the match and actually playing the match?
    I was wondering: how to deal with a game like RPGs and card games where you have both a planning phase (character building for RPGs and deck building for card games) and an execution phase (actually playing through a battle or a card duel)? The most obvious approach to me is making two different models, but their successes are highly correlated, i.e. a model can play a RPG battle and lose mostly because of the pre-selected equipments/skills or it can win despite the poor choice of character building. How to deal with this problem? Do you know any "success case" where I can see how they dealt with it? Thank you very much! submitted by /u/victorsevero [link] [comments]  ( 41 min )
    Presenting the AI Soccer World Cup! Place your bets!
    submitted by /u/xWh0am1 [link] [comments]  ( 41 min )
    Applications of policy evaluation (prediction)
    Really elementary question, but are there applications where people have shown interesting results by just solving a prediction problem (evaluate a policy) instead of solving a control problem (find the optimal policy)? It seems like control applications are more popular, but perhaps I'm missing something very obvious. Phrased differently, when would you want to do prediction without doing control? submitted by /u/wardellinthehouse [link] [comments]  ( 41 min )
    Building a Game-Playing AI with Reinforcement Learning in Python
    submitted by /u/Historical-Pen9653 [link] [comments]  ( 41 min )
    if a policy is greedy with respect to its own value function, then it is an optimal policy
    I’m doing the RL specialization from university of alberta and just ran into this question. I think that it is false, because although the policy may be greedy to its own value function, the value function might not be always optimal, you would need to iterate through it some times in order to make it optimal, or close to it. The course says that “If a policy is greedy with respect to its own value function, it follows from the policy improvement theorem and the Bellman optimality equation that it must be an optimal policy. How does that make sense if the value function is not the optimal one? Thanks in advance submitted by /u/enzodtz [link] [comments]  ( 44 min )
  • Open

    Guidance on 512 node input and 512 node output
    I have 512 input nodes which all have a float value and an associated classification (0 or 1). I believe I need 512 output layers which correspond to the classification of each input node. Can anyone give me some guidance into an approach I can use for this to get this done? I know that’s not a lot to go on but any help is appreciated, I am using tensorflow by the way. Thank you. submitted by /u/waysteman [link] [comments]  ( 41 min )
  • Open

    Share medical image research on Amazon SageMaker Studio Lab for free
    This post is co-written with Stephen Aylward, Matt McCormick, Brianna Major from Kitware and Justin Kirby from the Frederick National Laboratory for Cancer Research (FNLCR). Amazon SageMaker Studio Lab provides no-cost access to a machine learning (ML) development environment to everyone with an email address. Like the fully featured Amazon SageMaker Studio, Studio Lab allows […]  ( 8 min )
    Amazon SageMaker Automatic Model Tuning now supports three new completion criteria for hyperparameter optimization
    Amazon SageMaker has announced the support of three new completion criteria for Amazon SageMaker automatic model tuning, providing you with an additional set of levers to control the stopping criteria of the tuning job when finding the best hyperparameter configuration for your model. In this post, we discuss these new completion criteria, when to use them, and […]  ( 8 min )
  • Open

    Roses are red
    I had so much fun getting GPT-3 to generate simple one-line Valentine's Day cards last year that this year I decided to see if I could generate cards with more complicated messages. I focused on the classic "roses are red, violets are blue" rhyme, figuring that  ( 5 min )
    Bonus: more Valentine rhymes
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Google Research, 2022 & beyond: Algorithms for efficient deep learning
    Posted by Sanjiv Kumar, VP and Google Fellow, Google Research (This is Part 4 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) The explosion in deep learning a decade ago was catapulted in part by the convergence of new algorithms and architectures, a marked increase in data, and access to greater compute. In the last 10 years, AI and ML models have become bigger and more sophisticated — they’re deeper, more complex, with more parameters, and trained on much more data, resulting in some of the most transformative outcomes in the history of machine learning. As these models increasingly find themselves deployed in production and business applications, the efficiency and costs of these models has gon…  ( 97 min )
  • Open

    DSC Weekly 7 February 2023 – Machine Learning Controversy: From No-Code to No-Math
    Announcements Machine Learning Controversy: From No-Code to No-Math One controversial topic in machine learning circles is code versus no-code. Can you be a real data scientist if you don’t code? Of course you can: You may be leveraging platforms and the code is one or two layers below the responsibilities of your job. Maybe you… Read More »DSC Weekly 7 February 2023 – Machine Learning Controversy: From No-Code to No-Math The post DSC Weekly 7 February 2023 – Machine Learning Controversy: From No-Code to No-Math appeared first on Data Science Central.  ( 21 min )
    Top 8 Digital Marketing Trends That’ll Make a Comeback in 2023
    Digital marketing has become an efficient practice for growing and established businesses to promote their products and services. More and more organizations depend on digital channels to connect with customers and generate a vast customer base.  Because of its growing popularity, different digital marketing trends will reign in 2023. If businesses can pay attention to… Read More »Top 8 Digital Marketing Trends That’ll Make a Comeback in 2023 The post Top 8 Digital Marketing Trends That’ll Make a Comeback in 2023 appeared first on Data Science Central.  ( 20 min )
    The Impact of Data Labeling 2023: Current Trends & Future Demands
    Data labeling and/or data annotation has long been a critical component of many machine learning and AI initiatives. In recent years, the demand for accurate and reliable data labeling has risen dramatically as the process becomes increasingly vital to the success of numerous projects. But what is data labeling exactly? Data Labeling 2023 – how… Read More »The Impact of Data Labeling 2023: Current Trends & Future Demands The post The Impact of Data Labeling 2023: Current Trends & Future Demands appeared first on Data Science Central.  ( 22 min )
    A Quick Guide To iOS And Android App development In 2023
    If you look around, you’ll observe we have apps for everything. They’re like food, and we consume every bit of it. If you have a business, you must consider app development along with a website. Well, you’re at the right place. The app development journey can be overwhelming as it comes with tons of responsibilities,… Read More »A Quick Guide To iOS And Android App development In 2023 The post A Quick Guide To iOS And Android App development In 2023 appeared first on Data Science Central.  ( 22 min )
    Best 9 Mobile Apps to Develop Your Data Science Skills in 2023
    Mobile Apps to Develop Your Data Science Skills -Mobile phones are the most preferred medium of accomplishing minute-to-minutest tasks on a daily basis. We don’t need to visit any particular restaurant to take away the food, we can do this by just sitting on our favorite couch at home, thanks to food ordering apps. Not… Read More »Best 9 Mobile Apps to Develop Your Data Science Skills in 2023 The post Best 9 Mobile Apps to Develop Your Data Science Skills in 2023 appeared first on Data Science Central.  ( 23 min )
    Top Benefits of IoT for SMEs
    The Internet of Things (IoT) has been all the rage in the world for quite some time now. And understandably so, especially considering the many benefits it brings to the table. It has rendered the technology a lucrative choice for small-to-mid-sized enterprises (SMEs). Why so? Well, here are some of the benefits of IoT for… Read More »Top Benefits of IoT for SMEs The post Top Benefits of IoT for SMEs appeared first on Data Science Central.  ( 19 min )
  • Open

    Vietnam’s VinBrain Deploys Healthcare AI Models to 100+ Hospitals
    Doctors rarely make diagnoses based on a single factor — they look at a mix of data types, such as a patient’s symptoms, laboratory and radiology reports, and medical history. VinBrain, a Vietnam-based health-tech startup, is ensuring that AI diagnostics can take a similarly holistic view across vital signs, blood tests, medical images and more. Read article >  ( 6 min )
  • Open

    Mediant approximation trick
    Suppose you are trying to approximate some number x and you’ve got it sandwiched between two rational numbers: a/b < x < c/d. Now you’d like a better approximation. What would you do? The obvious approach would be to take the average of a/b and c/d. That’s fine, except it could be a fair amount […] Mediant approximation trick first appeared on John D. Cook.  ( 6 min )
  • Open

    Solving a machine-learning mystery
    A new study shows how large language models like GPT-3 can learn a new task from just a few examples, without the need for any new training data.  ( 10 min )
  • Open

    Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds. (arXiv:2301.12485v2 [q-bio.BM] UPDATED)
    Proteins power a vast array of functional processes in living cells. The capability to create new proteins with designed structures and functions would thus enable the engineering of cellular behavior and development of protein-based therapeutics and materials. Structure-based protein design aims to find structures that are designable (can be realized by a protein sequence), novel (have dissimilar geometry from natural proteins), and diverse (span a wide range of geometries). While advances in protein structure prediction have made it possible to predict structures of novel protein sequences, the combinatorially large space of sequences and structures limits the practicality of search-based methods. Generative models provide a compelling alternative, by implicitly learning the low-dimensional structure of complex data distributions. Here, we leverage recent advances in denoising diffusion probabilistic models and equivariant neural networks to develop Genie, a generative model of protein structures that performs discrete-time diffusion using a cloud of oriented reference frames in 3D space. Through in silico evaluations, we demonstrate that Genie generates protein backbones that are more designable, novel, and diverse than existing models. This indicates that Genie is capturing key aspects of the distribution of protein structure space and facilitates protein design with high success rates. Code for generating new proteins and training new versions of Genie is available at https://github.com/aqlaboratory/genie.  ( 2 min )
    Anti-Symmetric DGN: a stable architecture for Deep Graph Networks. (arXiv:2210.09789v2 [cs.LG] UPDATED)
    Deep Graph Networks (DGNs) currently dominate the research landscape of learning from graphs, due to their efficiency and ability to implement an adaptive message-passing scheme between the nodes. However, DGNs are typically limited in their ability to propagate and preserve long-term dependencies between nodes, i.e., they suffer from the over-squashing phenomena. This reduces their effectiveness, since predictive problems may require to capture interactions at different, and possibly large, radii in order to be effectively solved. In this work, we present Anti-Symmetric Deep Graph Networks (A-DGNs), a framework for stable and non-dissipative DGN design, conceived through the lens of ordinary differential equations. We give theoretical proof that our method is stable and non-dissipative, leading to two key results: long-range information between nodes is preserved, and no gradient vanishing or explosion occurs in training. We empirically validate the proposed approach on several graph benchmarks, showing that A-DGN yields to improved performance and enables to learn effectively even when dozens of layers are used.  ( 2 min )
    Knowledge Extraction in Low-Resource Scenarios: Survey and Perspective. (arXiv:2202.08063v3 [cs.CL] CROSS LISTED)
    Knowledge Extraction (KE), aiming to extract structural information from unstructured texts, often suffers from data scarcity and emerging unseen types, i.e., low-resource scenarios. Many neural approaches to low-resource KE have been widely investigated and achieved impressive performance. In this paper, we present a literature review towards KE in low-resource scenarios, and systematically categorize existing works into three paradigms: (1) exploiting higher-resource data, (2) exploiting stronger models, and (3) exploiting data and models together. In addition, we highlight promising applications and outline some potential directions for future research. We hope that our survey can help both the academic and industrial communities to better understand this field, inspire more ideas, and boost broader applications.  ( 2 min )
    Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. (arXiv:2210.02720v2 [cs.LG] UPDATED)
    Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.  ( 2 min )
    A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs. (arXiv:2206.04798v2 [cs.AI] UPDATED)
    Reasoning on large-scale knowledge graphs has been long dominated by embedding methods. While path-based methods possess the inductive capacity that embeddings lack, they suffer from the scalability issue due to the exponential number of paths. Here we present A*Net, a scalable path-based method for knowledge graph reasoning. Inspired by the A* algorithm for shortest path problems, our A*Net learns a priority function to select important nodes and edges at each iteration, to reduce time and memory footprint for both training and inference. The ratio of selected nodes and edges can be specified to trade off between performance and efficiency. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, while merely visiting 10% nodes and 10% edges at each iteration. On a million-scale dataset ogbl-wikikg2, A*Net achieves competitive performance with embedding methods and converges faster. To our best knowledge, A*Net is the first path-based method for knowledge graph reasoning at such a scale.  ( 2 min )
    One-shot domain adaptation in video-based assessment of surgical skills. (arXiv:2301.00812v2 [cs.CV] UPDATED)
    Deep Learning (DL) has achieved automatic and objective assessment of surgical skills. However, DL models are data-hungry and restricted to their training domain. This prevents them from transitioning to new tasks where data is limited. Hence, domain adaptation is crucial to implement DL in real life. Here, we propose a meta-learning model, A-VBANet, that can deliver domain-agnostic surgical skill classification via one-shot learning. We develop the A-VBANet on five laparoscopic and robotic surgical simulators. Additionally, we test it on operating room (OR) videos of laparoscopic cholecystectomy. Our model successfully adapts with accuracies up to 99.5% in one-shot and 99.9% in few-shot settings for simulated tasks and 89.7% for laparoscopic cholecystectomy. For the first time, we provide a domain-agnostic procedure for video-based assessment of surgical skills. A significant implication of this approach is that it allows the use of data from surgical simulators to assess performance in the operating room.  ( 2 min )
    Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. (arXiv:2212.04979v2 [cs.CV] UPDATED)
    This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to "flattened frame embeddings", yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates $N$ token embeddings per frame for totally $T$ video frames. We flatten $(N \times T)$ token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT, ActivityNet Captions and VATEX. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA) and video captioning (MSR-VTT, ActivityNet, VATEX, Youcook2). Our approach establishes a simple video-text baseline for future research.  ( 2 min )
    On the Opportunity of Causal Deep Generative Models: A Survey and Future Directions. (arXiv:2301.12351v2 [cs.LG] UPDATED)
    Deep generative models have gained popularity in recent years due to their ability to accurately replicate inherent empirical distributions and yield novel samples. In particular, certain advances are proposed wherein the model engenders data examples following specified attributes. Nevertheless, several challenges still exist and are to be overcome, i.e., difficulty in extrapolating out-of-sample data and insufficient learning of disentangled representations. Structural causal models (SCMs), on the other hand, encapsulate the causal factors that govern a generative process and characterize a generative model based on causal relationships, providing crucial insights for addressing the current obstacles in deep generative models. In this paper, we present a comprehensive survey of Causal deep Generative Models (CGMs), which combine SCMs and deep generative models in a way that boosts several trustworthy properties such as robustness, fairness, and interpretability. We provide an overview of the recent advances in CGMs, categorize them based on generative types, and discuss how causality is introduced into the family of deep generative models. We also explore potential avenues for future research in this field.  ( 2 min )
    Post-Selection Confidence Bounds for Prediction Performance. (arXiv:2210.13206v3 [stat.ML] UPDATED)
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.  ( 3 min )
    When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning. (arXiv:2205.11027v2 [cs.LG] UPDATED)
    In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.  ( 2 min )
    S$^3$NN: Time Step Reduction of Spiking Surrogate Gradients for Training Energy Efficient Single-Step Spiking Neural Networks. (arXiv:2201.10879v2 [cs.LG] UPDATED)
    As the scales of neural networks increase, techniques that enable them to run with low computational cost and energy efficiency are required. From such demands, various efficient neural network paradigms, such as spiking neural networks (SNNs) or binary neural networks (BNNs), have been proposed. However, they have sticky drawbacks, such as degraded inference accuracy and latency. To solve these problems, we propose a single-step spiking neural network (S$^3$NN), an energy-efficient neural network with low computational cost and high precision. The proposed S$^3$NN processes the information between hidden layers by spikes as SNNs. Nevertheless, it has no temporal dimension so that there is no latency within training and inference phases as BNNs. Thus, the proposed S$^3$NN has a lower computational cost than SNNs that require time-series processing. However, S$^3$NN cannot adopt na\"{i}ve backpropagation algorithms due to the non-differentiability nature of spikes. We deduce a suitable neuron model by reducing the surrogate gradient for multi-time step SNNs to a single-time step. We experimentally demonstrated that the obtained surrogate gradient allows S$^3$NN to be trained appropriately. We also showed that the proposed S$^3$NN could achieve comparable accuracy to full-precision networks while being highly energy-efficient.  ( 2 min )
    Consistent Range Approximation for Fair Predictive Modeling. (arXiv:2212.10839v2 [cs.LG] UPDATED)
    This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.  ( 2 min )
    Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. (arXiv:2206.09889v3 [cs.MA] UPDATED)
    We introduce Nocturne, a new 2D driving simulator for investigating multi-agent coordination under partial observability. The focus of Nocturne is to enable research into inference and theory of mind in real-world multi-agent settings without the computational overhead of computer vision and feature extraction from images. Agents in this simulator only observe an obstructed view of the scene, mimicking human visual sensing constraints. Unlike existing benchmarks that are bottlenecked by rendering human-like observations directly using a camera input, Nocturne uses efficient intersection methods to compute a vectorized set of visible features in a C++ back-end, allowing the simulator to run at over 2000 steps-per-second. Using open-source trajectory and map data, we construct a simulator to load and replay arbitrary trajectories and scenes from real-world driving data. Using this environment, we benchmark reinforcement-learning and imitation-learning agents and demonstrate that the agents are quite far from human-level coordination ability and deviate significantly from the expert trajectories.
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v2 [physics.ao-ph] UPDATED)
    Machine learning models have been employed to perform either physics-free data-driven or hybrid dynamical downscaling of climate data. Most of these implementations operate over relatively small downscaling factors because of the challenge of recovering fine-scale information from coarse data. This limits their compatibility with many global climate model outputs, often available between $\sim$50--100 km resolution, to scales of interest such as cloud resolving or urban scales. This study systematically examines the capability of convolutional neural networks (CNNs) to downscale surface wind speed data over land surface from different coarse resolutions (25 km, 48 km, and 100 km resolution) to 3 km. For each downscaling factor, we consider three CNN configurations that generate super-resolved predictions of fine-scale wind speed, which take between 1 to 3 input fields: coarse wind speed, fine-scale topography, and diurnal cycle. In addition to fine-scale wind speeds, probability density function parameters are generated, through which sample wind speeds can be generated accounting for the intrinsic stochasticity of wind speed. For generalizability assessment, CNN models are tested on regions with different topography and climate that are unseen during training. The evaluation of super-resolved predictions focuses on subgrid-scale variability and the recovery of extremes. Models with coarse wind and fine topography as inputs exhibit the best performance compared with other model configurations, operating across the same downscaling factor. Our diurnal cycle encoding results in lower out-of-sample generalizability compared with other input configurations.  ( 2 min )
    Learning Counterfactually Invariant Predictors. (arXiv:2207.09768v2 [cs.LG] UPDATED)
    Counterfactual invariance has proven an essential property for predictors that are fair, robust, and generalizable in the real world. We propose a general definition of counterfactual invariance and provide simple graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of (conditional independence in) the observational distribution. Any predictor that satisfies our criterion is provably counterfactually invariant. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactual Invariance Prediction (CIP), based on a kernel-based conditional dependence measure called Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various types of data including tabular, high-dimensional, and real-world dataset.  ( 2 min )
    A Systematic Survey of Molecular Pre-trained Models. (arXiv:2210.16484v2 [cs.LG] UPDATED)
    Deep learning has achieved remarkable success in learning representations for molecules, which is crucial for various biochemical applications, ranging from property prediction to drug design. However, training Deep Neural Networks (DNNs) from scratch often requires abundant labeled molecules, which are expensive to acquire in the real world. To alleviate this issue, tremendous efforts have been devoted to Molecular Pre-trained Models (MPMs), where DNNs are pre-trained using large-scale unlabeled molecular databases and then fine-tuned over specific downstream tasks. Despite the prosperity, there lacks a systematic review of this fast-growing field. In this paper, we present the first survey that summarizes the current progress of MPMs. We first highlight the limitations of training molecular representation models from scratch to motivate MPM studies. Next, we systematically review recent advances on this topic from several key perspectives, including molecular descriptors, encoder architectures, pre-training strategies, and applications. We also highlight the challenges and promising avenues for future research, providing a useful resource for both machine learning and scientific communities.  ( 2 min )
    Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences. (arXiv:2210.15906v3 [cs.AI] UPDATED)
    Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement on AI agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines. The supplementary video and source code are available at: https://guansuns.github.io/pages/rba.
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v2 [stat.ML] UPDATED)
    Spurious correlations, or correlations that change across domains where a model can be deployed, present significant challenges to real-world applications of machine learning models. However, such correlations are not always "spurious"; often, they provide valuable prior information for a prediction. Here, we present a test-time adaptation method that exploits the spurious correlation phenomenon, in contrast to recent approaches that attempt to eliminate spurious correlations through invariance. We consider situations where the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, but the generative model for features $p(\mathbf{x}|y, z)$ is constant. We note that this corresponds to an expanded version of the label shift assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we train a classifier to predict $p(y, z|\mathbf{x})$ on the source distribution, and propose a test-time label shift correction that adapts to changes in the marginal distribution $p(y, z)$ using unlabeled samples from the target domain. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on two different image datasets -- the CheXpert chest X-ray dataset and the Colored MNIST dataset -- and show a significant improvement over baseline methods. Code reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .
    IR-MCL: Implicit Representation-Based Online Global Localization. (arXiv:2210.03113v2 [cs.RO] UPDATED)
    Determining the state of a mobile robot is an essential building block of robot navigation systems. In this paper, we address the problem of estimating the robots pose in an indoor environment using 2D LiDAR data and investigate how modern environment models can improve gold standard Monte-Carlo localization (MCL) systems. We propose a neural occupancy field to implicitly represent the scene using a neural network. With the pretrained network, we can synthesize 2D LiDAR scans for an arbitrary robot pose through volume rendering. Based on the implicit representation, we can obtain the similarity between a synthesized and actual scan as an observation model and integrate it into an MCL system to perform accurate localization. We evaluate our approach on self-recorded datasets and three publicly available ones. We show that we can accurately and efficiently localize a robot using our approach surpassing the localization performance of state-of-the-art methods. The experiments suggest that the presented implicit representation is able to predict more accurate 2D LiDAR scans leading to an improved observation model for our particle filter-based localization. The code of our approach will be available at: https://github.com/PRBonn/ir-mcl.
    Causal Modeling of Policy Interventions From Sequences of Treatments and Outcomes using Gaussian Processes. (arXiv:2209.04142v4 [cs.LG] UPDATED)
    A treatment policy defines when and what treatments are applied to affect some outcome of interest. Data-driven decision-making requires the ability to predict what happens if a policy is changed. Existing methods that predict how the outcome evolves under different scenarios assume that the tentative sequences of future treatments are fixed in advance, while in practice the treatments are determined stochastically by a policy and may depend for example on the efficiency of previous treatments. Therefore, the current methods are not applicable if the treatment policy is unknown or a counterfactual analysis is needed. To handle these limitations, we model the treatments and outcomes jointly in continuous time, by combining Gaussian processes and point processes. Our model enables the estimation of a treatment policy from observational sequences of treatments and outcomes, and it can predict the interventional and counterfactual progression of the outcome after an intervention on the treatment policy (in contrast with the causal effect of a single treatment). We show with real-world and semi-synthetic data on blood glucose progression that our method can answer causal queries more accurately than existing alternatives.
    Vicarious Offense and Noise Audit of Offensive Speech Classifiers. (arXiv:2301.12534v2 [cs.CL] UPDATED)
    This paper examines social web content moderation from two key perspectives: automated methods (machine moderators) and human evaluators (human moderators). We conduct a noise audit at an unprecedented scale using nine machine moderators trained on well-known offensive speech data sets evaluated on a corpus sampled from 92 million YouTube comments discussing a multitude of issues relevant to US politics. We introduce a first-of-its-kind data set of vicarious offense. We ask annotators: (1) if they find a given social media post offensive; and (2) how offensive annotators sharing different political beliefs would find the same content. Our experiments with machine moderators reveal that moderation outcomes wildly vary across different machine moderators. Our experiments with human moderators suggest that (1) political leanings considerably affect first-person offense perspective; (2) Republicans are the worst predictors of vicarious offense; (3) predicting vicarious offense for the Republicans is most challenging than predicting vicarious offense for the Independents and the Democrats; and (4) disagreement across political identity groups considerably increases when sensitive issues such as reproductive rights or gun control/rights are discussed. Both experiments suggest that offense, is indeed, highly subjective and raise important questions concerning content moderation practices.  ( 2 min )
    PDEBENCH: An Extensive Benchmark for Scientific Machine Learning. (arXiv:2210.07182v4 [cs.LG] UPDATED)
    Machine learning-based modeling of physical systems has experienced increased interest in recent years. Despite some impressive progress, there is still a lack of benchmarks for Scientific ML that are easy to use but still challenging and representative of a wide range of problems. We introduce PDEBench, a benchmark suite of time-dependent simulation tasks based on Partial Differential Equations (PDEs). PDEBench comprises both code and data to benchmark the performance of novel machine learning models against both classical numerical simulations and machine learning baselines. Our proposed set of benchmark problems contribute the following unique features: (1) A much wider range of PDEs compared to existing benchmarks, ranging from relatively common examples to more realistic and difficult problems; (2) much larger ready-to-use datasets compared to prior work, comprising multiple simulation runs across a larger number of initial and boundary conditions and PDE parameters; (3) more extensible source codes with user-friendly APIs for data generation and baseline results with popular machine learning models (FNO, U-Net, PINN, Gradient-Based Inverse Method). PDEBench allows researchers to extend the benchmark freely for their own purposes using a standardized API and to compare the performance of new models to existing baseline methods. We also propose new evaluation metrics with the aim to provide a more holistic understanding of learning methods in the context of Scientific ML. With those metrics we identify tasks which are challenging for recent ML methods and propose these tasks as future challenges for the community. The code is available at https://github.com/pdebench/PDEBench.
    SketchySGD: Reliable Stochastic Optimization via Robust Curvature Estimates. (arXiv:2211.08597v3 [math.OC] UPDATED)
    We introduce SketchySGD, a stochastic quasi-Newton method that uses sketching to approximate the curvature of the loss function. SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems show that SketchySGD can achieve comparable or better results to popular stochastic gradient methods with minimal hyperparameter tuning. The robustness of SketchySGD to hyperparameters is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance.
    RL4ReAl: Reinforcement Learning for Register Allocation. (arXiv:2204.02013v2 [cs.LG] UPDATED)
    We aim to automate decades of research and experience in register allocation, leveraging machine learning. We tackle this problem by embedding a multi-agent reinforcement learning algorithm within LLVM, training it with the state of the art techniques. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training and inference. Our approach is architecture independent: we show experimental results targeting Intel x86 and ARM AArch64. Our results match or out-perform the heavily tuned, production-grade register allocators of LLVM.  ( 2 min )
    SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. (arXiv:2210.01108v2 [cs.CL] UPDATED)
    We propose MINT, a new Multilingual INTimacy analysis dataset covering 13,372 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. We benchmarked a list of popular multilingual pre-trained language models. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis (https://sites.google.com/umich.edu/semeval-2023-tweet-intimacy).
    New Machine Learning Techniques for Simulation-Based Inference: InferoStatic Nets, Kernel Score Estimation, and Kernel Likelihood Ratio Estimation. (arXiv:2210.01680v2 [stat.ML] UPDATED)
    We propose an intuitive, machine-learning approach to multiparameter inference, dubbed the InferoStatic Networks (ISN) method, to model the score and likelihood ratio estimators in cases when the probability density can be sampled but not computed directly. The ISN uses a backend neural network that models a scalar function called the inferostatic potential $\varphi$. In addition, we introduce new strategies, respectively called Kernel Score Estimation (KSE) and Kernel Likelihood Ratio Estimation (KLRE), to learn the score and the likelihood ratio functions from simulated data. We illustrate the new techniques with some toy examples and compare to existing approaches in the literature. We mention en passant some new loss functions that optimally incorporate latent information from simulations into the training procedure.
    Conditional Antibody Design as 3D Equivariant Graph Translation. (arXiv:2208.06073v4 [q-bio.BM] UPDATED)
    Antibody design is valuable for therapeutic usage and biological research. Existing deep-learning-based methods encounter several key issues: 1) incomplete context for Complementarity-Determining Regions (CDRs) generation; 2) incapability of capturing the entire 3D geometry of the input structure; 3) inefficient prediction of the CDR sequences in an autoregressive manner. In this paper, we propose Multi-channel Equivariant Attention Network (MEAN) to co-design 1D sequences and 3D structures of CDRs. To be specific, MEAN formulates antibody design as a conditional graph translation problem by importing extra components including the target antigen and the light chain of the antibody. Then, MEAN resorts to E(3)-equivariant message passing along with a proposed attention mechanism to better capture the geometrical correlation between different components. Finally, it outputs both the 1D sequences and 3D structure via a multi-round progressive full-shot scheme, which enjoys more efficiency and precision against previous autoregressive approaches. Our method significantly surpasses state-of-the-art models in sequence and structure modeling, antigen-binding CDR design, and binding affinity optimization. Specifically, the relative improvement to baselines is about 23% in antigen-binding CDR design and 34% for affinity optimization.
    Testing Rare Downstream Safety Violations via Upstream Adaptive Sampling of Perception Error Models. (arXiv:2209.09674v3 [cs.RO] UPDATED)
    Testing black-box perceptual-control systems in simulation faces two difficulties. Firstly, perceptual inputs in simulation lack the fidelity of real-world sensor inputs. Secondly, for a reasonably accurate perception system, encountering a rare failure trajectory may require running infeasibly many simulations. This paper combines perception error models -- surrogates for a sensor-based detection system -- with state-dependent adaptive importance sampling. This allows us to efficiently assess the rare failure probabilities for real-world perceptual control systems within simulation. Our experiments with an autonomous braking system equipped with an RGB obstacle-detector show that our method can calculate accurate failure probabilities with an inexpensive number of simulations. Further, we show how choice of safety metric can influence the process of learning proposal distributions capable of reliably sampling high-probability failures.
    Physically Consistent Learning of Conservative Lagrangian Systems with Gaussian Processes. (arXiv:2206.12272v3 [cs.LG] UPDATED)
    This paper proposes a physically consistent Gaussian Process (GP) enabling the identification of uncertain Lagrangian systems. The function space is tailored according to the energy components of the Lagrangian and the differential equation structure, analytically guaranteeing physical and mathematical properties such as energy conservation and quadratic form. The novel formulation of Cholesky decomposed matrix kernels allow the probabilistic preservation of positive definiteness. Only differential input-to-output measurements of the function map are required while Gaussian noise is permitted in torques, velocities, and accelerations. We demonstrate the effectiveness of the approach in numerical simulation.
    Socially Fair Reinforcement Learning. (arXiv:2208.12584v2 [cs.LG] UPDATED)
    We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v3 [stat.ME] UPDATED)
    We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.
    Searching for the Essence of Adversarial Perturbations. (arXiv:2205.15357v3 [cs.LG] UPDATED)
    Neural networks have demonstrated state-of-the-art performance in various machine learning fields. However, the introduction of malicious perturbations in input data, known as adversarial examples, has been shown to deceive neural network predictions. This poses potential risks for real-world applications such as autonomous driving and text identification. In order to mitigate these risks, a comprehensive understanding of the mechanisms underlying adversarial examples is essential. In this study, we demonstrate that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's incorrect prediction, in contrast to the widely held belief that human-unidentifiable characteristics play a critical role in fooling a network. This concept of human-recognizable characteristics enables us to explain key features of adversarial perturbations, including their existence, transferability among different neural networks, and increased interpretability for adversarial training. We also uncover two unique properties of adversarial perturbations that deceive neural networks: masking and generation. Additionally, a special class, the complementary class, is identified when neural networks classify input images. The presence of human-recognizable information in adversarial perturbations allows researchers to gain insight into the working principles of neural networks and may lead to the development of techniques for detecting and defending against adversarial attacks.
    GOOD: Exploring Geometric Cues for Detecting Objects in an Open World. (arXiv:2212.11720v3 [cs.CV] UPDATED)
    We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.  ( 2 min )
    Realizable Learning is All You Need. (arXiv:2111.04746v3 [cs.LG] UPDATED)
    The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.  ( 2 min )
    Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure. (arXiv:2206.03569v3 [cs.LG] UPDATED)
    The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $\epsilon$-optimal policy is $\tilde{\Omega}\left(|S||A|H^3 / \epsilon^2\right)$ over worst case instances of an MDP with state space $S$, action space $A$, and horizon $H$. We consider a class of MDPs for which the associated optimal $Q^*$ function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in $|S|$ and $|A|$ due to the low rank structure, we show that without imposing further assumptions beyond low rank of $Q^*$, if one is constrained to estimate the $Q$ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon $H$ to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/\epsilon^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $\epsilon$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v2 [stat.ML] UPDATED)
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.
    Self-Programming Artificial Intelligence Using Code-Generating Language Models. (arXiv:2205.00167v2 [cs.AI] UPDATED)
    Recent progress in large-scale language models has enabled breakthroughs in previously intractable computer programming tasks. Prior work in meta-learning and neural architecture search has led to substantial successes across various task domains, spawning myriad approaches for algorithmically optimizing the design and learning dynamics of deep learning models. At the intersection of these research areas, we implement a code-generating language model with the ability to modify its own source code. Self-programming AI algorithms have been of interest since the dawn of AI itself. Although various theoretical formulations of generalized self-programming AI have been posed, no such system has been successfully implemented to date under real-world computational constraints. Applying AI-based code generation to AI itself, we develop and experimentally validate the first practical implementation of a self-programming AI system. We empirically show that a self-programming AI implemented using a code generation model can successfully modify its own source code to improve performance and program sub-models to perform auxiliary tasks. Our model can self-modify various properties including model architecture, computational capacity, and learning dynamics.  ( 2 min )
    PyGlove: Efficiently Exchanging ML Ideas as Code. (arXiv:2302.01918v1 [cs.LG])
    The increasing complexity and scale of machine learning (ML) has led to the need for more efficient collaboration among multiple teams. For example, when a research team invents a new architecture like "ResNet," it is desirable for multiple engineering teams to adopt it. However, the effort required for each team to study and understand the invention does not scale well with the number of teams or inventions. In this paper, we present an extension of our PyGlove library to easily and scalably share ML ideas. PyGlove represents ideas as symbolic rule-based patches, enabling researchers to write down the rules for models they have not seen. For example, an inventor can write rules that will "add skip-connections." This permits a network effect among teams: at once, any team can issue patches to all other teams. Such a network effect allows users to quickly surmount the cost of adopting PyGlove by writing less code quicker, providing a benefit that scales with time. We describe the new paradigm of organizing ML through symbolic patches and compare it to existing approaches. We also perform a case study of a large codebase where PyGlove led to an 80% reduction in the number of lines of code.  ( 2 min )
    Fast Feature Selection with Fairness Constraints. (arXiv:2202.13718v2 [cs.LG] UPDATED)
    We study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-submodular functions. The proposed algorithm achieves exponentially fast parallel run time in the adaptive query model, scaling much better than prior work. Furthermore, our extension allows the use of downward-closed constraints, which can be used to encode certain fairness criteria into the feature selection process. We prove strong approximation guarantees for the algorithm based on standard assumptions. These guarantees are applicable to many parametric models, including Generalized Linear Models. Finally, we demonstrate empirically that the proposed algorithm competes favorably with state-of-the-art techniques for feature selection, on real-world and synthetic datasets.  ( 2 min )
    Fast Bayesian Optimization of Needle-in-a-Haystack Problems using Zooming Memory-Based Initialization (ZoMBI). (arXiv:2208.13771v2 [cs.LG] UPDATED)
    Needle-in-a-Haystack problems exist across a wide range of applications including rare disease prediction, ecological resource management, fraud detection, and material property optimization. A Needle-in-a-Haystack problem arises when there is an extreme imbalance of optimum conditions relative to the size of the dataset. For example, only $0.82\%$ out of $146$k total materials in the open-access Materials Project database have a negative Poisson's ratio. However, current state-of-the-art optimization algorithms are not designed with the capabilities to find solutions to these challenging multidimensional Needle-in-a-Haystack problems, resulting in slow convergence to a global optimum or pigeonholing into a local minimum. In this paper, we present a Zooming Memory-Based Initialization algorithm, entitled ZoMBI. ZoMBI actively extracts knowledge from the previously best-performing evaluated experiments to iteratively zoom in the sampling search bounds towards the global optimum "needle" and then prunes the memory of low-performing historical experiments to accelerate compute times by reducing the algorithm time complexity from $O(n^3)$ to $O(\phi^3)$ for $\phi$ forward experiments per activation, which trends to a constant $O(1)$ over several activations. Additionally, ZoMBI implements two custom adaptive acquisition functions to further guide the sampling of new experiments toward the global optimum. We validate the algorithm's optimization performance on three real-world datasets exhibiting Needle-in-a-Haystack and further stress-test the algorithm's performance on an additional 174 analytical datasets. The ZoMBI algorithm demonstrates compute time speed-ups of 400x compared to traditional Bayesian optimization as well as efficiently discovering optima in under 100 experiments that are up to 3x more highly optimized than those discovered by similar methods MiP-EGO, TuRBO, and HEBO.
    Accelerometry-based classification of circulatory states during out-of-hospital cardiac arrest. (arXiv:2205.06540v2 [eess.SP] UPDATED)
    Objective: Exploit accelerometry data for an automatic, reliable, and prompt detection of spontaneous circulation during cardiac arrest, as this is both vital for patient survival and practically challenging. Methods: We developed a machine learning algorithm to automatically predict the circulatory state during cardiopulmonary resuscitation from 4-second-long snippets of accelerometry and electrocardiogram (ECG) data from pauses of chest compressions of real-world defibrillator records. The algorithm was trained based on 422 cases from the German Resuscitation Registry, for which ground truth labels were created by a manual annotation of physicians. It uses a kernelized Support Vector Machine classifier based on 49 features, which partially reflect the correlation between accelerometry and electrocardiogram data. Results: Evaluating 50 different test-training data splits, the proposed algorithm exhibits a balanced accuracy of 81.2%, a sensitivity of 80.6%, and a specificity of 81.8%, whereas using only ECG leads to a balanced accuracy of 76.5%, a sensitivity of 80.2%, and a specificity of 72.8%. Conclusion: The first method employing accelerometry for pulse/no-pulse decision yields a significant increase in performance compared to single ECG-signal usage. Significance: This shows that accelerometry provides relevant information for pulse/no-pulse decisions. In application, such an algorithm may be used to simplify retrospective annotation for quality management and, moreover, to support clinicians to assess circulatory state during cardiac arrest treatment.
    The Solvability of Interpretability Evaluation Metrics. (arXiv:2205.08696v2 [cs.LG] UPDATED)
    Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric, which can be solved by beam search. This observation leads to the obvious yet unaddressed question: why do we use explainers (e.g., LIME) not based on solving the target metric, if the metric value represents explanation quality? We present a series of investigations showing strong performance of this beam search explainer and discuss its broader implication: a definition-evaluation duality of interpretability concepts. We implement the explainer and release the Python solvex package for models of text, image and tabular domains.  ( 2 min )
    LEAF: Navigating Concept Drift in Cellular Networks. (arXiv:2109.03011v5 [cs.NI] UPDATED)
    Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target to be predicted changes. Mitigating concept drift is an essential part of operationalizing machine learning models in general, but is of particular importance in networking's highly dynamic deployment environments. In this paper, we first characterize concept drift in a large cellular network for a major metropolitan area in the United States. We find that concept drift occurs across many important key performance indicators (KPIs), independently of the model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. We then show that frequent model retraining with newly available data is not sufficient to mitigate concept drift, and can even degrade model accuracy further. Finally, we develop a new methodology for concept drift mitigation, Local Error Approximation of Features (LEAF). LEAF works by detecting drift; explaining the features and time intervals that contribute the most to drift; and mitigates it using forgetting and over-sampling. We evaluate LEAF against industry-standard mitigation approaches (notably, periodic retraining) with more than four years of cellular KPI data. Our initial tests with a major cellular provider in the US show that LEAF consistently outperforms periodic and triggered retraining on complex, real-world data while reducing costly retraining operations.  ( 2 min )
    Online Verification of Deep Neural Networks under Domain Shift or Network Updates. (arXiv:2106.12732v2 [cs.LG] UPDATED)
    Although neural networks are widely used, it remains challenging to formally verify the safety and robustness of neural networks in real-world applications. Existing methods are designed to verify the network before deployment, which are limited to relatively simple specifications and fixed networks. These methods are not ready to be applied to real-world problems with complex and/or dynamically changing specifications and networks. To effectively handle such problems, verification needs to be performed online when these changes take place. However, it is still challenging to run existing verification algorithms online. Our key insight is that we can leverage the temporal dependencies of these changes to accelerate the verification process. This paper establishes a novel framework for scalable online verification to solve real-world verification problems with dynamically changing specifications and/or networks. We propose three types of acceleration algorithms: Branch Management to reduce repetitive computation, Perturbation Tolerance to tolerate changes, and Incremental Computation to reuse previous results. Experiment results show that our algorithms achieve up to $100\times$ acceleration, and thus show a promising way to extend neural network verification to real-world applications.  ( 2 min )
    Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality. (arXiv:2205.13521v2 [cs.AI] UPDATED)
    Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.  ( 2 min )
    Towards Optimal Branching of Linear and Semidefinite Relaxations for Neural Network Robustness Certification. (arXiv:2101.09306v2 [cs.LG] UPDATED)
    In this paper, we study certifying the robustness of ReLU neural networks against adversarial input perturbations. To diminish the relaxation error suffered by the popular linear programming (LP) and semidefinite programming (SDP) certification methods, we take a branch-and-bound approach to propose partitioning the input uncertainty set and solving the relaxations on each part separately. We show that this approach reduces relaxation error, and that the error is eliminated entirely upon performing an LP relaxation with a partition intelligently designed to exploit the nature of the ReLU activations. To scale this approach to large networks, we consider using a coarser partition whereby the number of parts in the partition is reduced. We prove that computing such a coarse partition that directly minimizes the LP relaxation error is NP-hard. By instead minimizing the worst-case LP relaxation error, we develop a closed-form branching scheme. We extend the analysis to the SDP, where the feasible set geometry is exploited to design a branching scheme that minimizes the worst-case SDP relaxation error. Experiments on MNIST, CIFAR-10, and Wisconsin breast cancer diagnosis classifiers demonstrate significant increases in the percentages of test samples certified. By independently increasing the input size and the number of layers, we empirically illustrate under which regimes the branched LP and branched SDP are best applied.  ( 2 min )
    Aligning Robot and Human Representations. (arXiv:2302.01928v1 [cs.RO])
    To act in the world, robots rely on a representation of salient task aspects: for example, to carry a cup of coffee, a robot must consider movement efficiency and cup orientation in its behaviour. However, if we want robots to act for and with people, their representations must not be just functional but also reflective of what humans care about, i.e. their representations must be aligned with humans'. In this survey, we pose that current reward and imitation learning approaches suffer from representation misalignment, where the robot's learned representation does not capture the human's representation. We suggest that because humans will be the ultimate evaluator of robot performance in the world, it is critical that we explicitly focus our efforts on aligning learned task representations with humans, in addition to learning the downstream task. We advocate that current representation learning approaches in robotics should be studied from the perspective of how well they accomplish the objective of representation alignment. To do so, we mathematically define the problem, identify its key desiderata, and situate current robot learning methods within this formalism. We conclude the survey by suggesting future directions for exploring open challenges.  ( 2 min )
    Enhancing Once-For-All: A Study on Parallel Blocks, Skip Connections and Early Exits. (arXiv:2302.01888v1 [cs.LG])
    The use of Neural Architecture Search (NAS) techniques to automate the design of neural networks has become increasingly popular in recent years. The proliferation of devices with different hardware characteristics using such neural networks, as well as the need to reduce the power consumption for their search, has led to the realisation of Once-For-All (OFA), an eco-friendly algorithm characterised by the ability to generate easily adaptable models through a single learning process. In order to improve this paradigm and develop high-performance yet eco-friendly NAS techniques, this paper presents OFAv2, the extension of OFA aimed at improving its performance while maintaining the same ecological advantage. The algorithm is improved from an architectural point of view by including early exits, parallel blocks and dense skip connections. The training process is extended by two new phases called Elastic Level and Elastic Height. A new Knowledge Distillation technique is presented to handle multi-output networks, and finally a new strategy for dynamic teacher network selection is proposed. These modifications allow OFAv2 to improve its accuracy performance on the Tiny ImageNet dataset by up to 12.07% compared to the original version of OFA, while maintaining the algorithm flexibility and advantages.  ( 2 min )
    Unsupervised hierarchical clustering using the learning dynamics of RBMs. (arXiv:2302.01851v1 [cs.LG])
    Datasets in the real world are often complex and to some degree hierarchical, with groups and sub-groups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these datasets is an important task that has many practical applications. To address this challenge, we present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM). Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v2 [stat.ML] UPDATED)
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.  ( 2 min )
    A Case Study for Compliance as Code with Graphs and Language Models: Public release of the Regulatory Knowledge Graph. (arXiv:2302.01842v1 [cs.AI])
    The paper presents a study on using language models to automate the construction of executable Knowledge Graph (KG) for compliance. The paper focuses on Abu Dhabi Global Market regulations and taxonomy, involves manual tagging a portion of the regulations, training BERT-based models, which are then applied to the rest of the corpus. Coreference resolution and syntax analysis were used to parse the relationships between the tagged entities and to form KG stored in a Neo4j database. The paper states that the use of machine learning models released by regulators to automate the interpretation of rules is a vital step towards compliance automation, demonstrates the concept querying with Cypher, and states that the produced sub-graphs combined with Graph Neural Networks (GNN) will achieve expandability in judgment automation systems. The graph is open sourced on GitHub to provide structured data for future advancements in the field.  ( 2 min )
    Online Ad Allocation with Predictions. (arXiv:2302.01827v1 [cs.LG])
    Display Ads and the generalized assignment problem are two well-studied online packing problems with important applications in ad allocation and other areas. In both problems, ad impressions arrive online and have to be allocated immediately to budget-constrained advertisers. Worst-case algorithms that achieve the ideal competitive ratio are known, but might act overly conservative given the predictable and usually tame nature of real-world input. Given this discrepancy, we develop an algorithm for both problems that incorporate machine-learned predictions and can thus improve the performance beyond the worst-case. Our algorithm is based on the work of Feldman et al. (2009) and similar in nature to Mahdian et al. (2007) who were the first to develop a learning-augmented algorithm for the related, but more structured Ad Words problem. We use a novel analysis to show that our algorithm is able to capitalize on a good prediction, while being robust against poor predictions. We experimentally evaluate our algorithm on synthetic and real-world data on a wide range of predictions. Our algorithm is consistently outperforming the worst-case algorithm without predictions.  ( 2 min )
    From Robustness to Privacy and Back. (arXiv:2302.01855v1 [cs.LG])
    We study the relationship between two desiderata of algorithms in statistical inference and machine learning: differential privacy and robustness to adversarial data corruptions. Their conceptual similarity was first observed by Dwork and Lei (STOC 2009), who observed that private algorithms satisfy robustness, and gave a general method for converting robust algorithms to private ones. However, all general methods for transforming robust algorithms into private ones lead to suboptimal error rates. Our work gives the first black-box transformation that converts any adversarially robust algorithm into one that satisfies pure differential privacy. Moreover, we show that for any low-dimensional estimation task, applying our transformation to an optimal robust estimator results in an optimal private estimator. Thus, we conclude that for any low-dimensional task, the optimal error rate for $\varepsilon$-differentially private estimators is essentially the same as the optimal error rate for estimators that are robust to adversarially corrupting $1/\varepsilon$ training samples. We apply our transformation to obtain new optimal private estimators for several high-dimensional tasks, including Gaussian (sparse) linear regression and PCA. Finally, we present an extension of our transformation that leads to approximate differentially private algorithms whose error does not depend on the range of the output space, which is impossible under pure differential privacy.  ( 2 min )
    AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners. (arXiv:2302.01877v1 [cs.LG])
    Diffusion models have demonstrated their powerful generative capability in many tasks, with great potential to serve as a paradigm for offline reinforcement learning. However, the quality of the diffusion model is limited by the insufficient diversity of training data, which hinders the performance of planning and the generalizability to new tasks. This paper introduces AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, not only for seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the generation of rich synthetic expert data for goal-conditioned tasks using guidance from reward gradients. It then selects high-quality data via a discriminator to finetune the diffusion model, which improves the generalization ability to unseen tasks. Empirical experiments on two benchmark environments and two carefully designed unseen tasks in KUKA industrial robot arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8% on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks, e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data.  ( 2 min )
    PINN Training using Biobjective Optimization: The Trade-off between Data Loss and Residual Loss. (arXiv:2302.01810v1 [cs.LG])
    Physics informed neural networks (PINNs) have proven to be an efficient tool to represent problems for which measured data are available and for which the dynamics in the data are expected to follow some physical laws. In this paper, we suggest a multiobjective perspective on the training of PINNs by treating the data loss and the residual loss as two individual objective functions in a truly biobjective optimization approach. As a showcase example, we consider COVID-19 predictions in Germany and built an extended susceptibles-infected-recovered (SIR) model with additionally considered leaky-vaccinated and hospitalized populations (SVIHR model) to model the transition rates and to predict future infections. SIR-type models are expressed by systems of ordinary differential equations (ODEs). We investigate the suitability of the generated PINN for COVID-19 predictions and compare the resulting predicted curves with those obtained by applying the method of non-standard finite differences to the system of ODEs and initial data. The approach is applicable to various systems of ODEs that define dynamical regimes. Those regimes do not need to be SIR-type models, and the corresponding underlying data sets do not have to be associated with COVID-19.  ( 2 min )
    Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach. (arXiv:2302.01887v1 [cs.LG])
    Natural language processing (NLP) is a promising approach for analyzing large volumes of climate-change and infrastructure-related scientific literature. However, best-in-practice NLP techniques require large collections of relevant documents (corpus). Furthermore, NLP techniques using machine learning and deep learning techniques require labels grouping the articles based on user-defined criteria for a significant subset of a corpus in order to train the supervised model. Even labeling a few hundred documents with human subject-matter experts is a time-consuming process. To expedite this process, we developed a weak supervision-based NLP approach that leverages semantic similarity between categories and documents to (i) establish a topic-specific corpus by subsetting a large-scale open-access corpus and (ii) generate category labels for the topic-specific corpus. In comparison with a months-long process of subject-matter expert labeling, we assign category labels to the whole corpus using weak supervision and supervised learning in about 13 hours. The labeled climate and NCF corpus enable targeted, efficient identification of documents discussing a topic (or combination of topics) of interest and identification of various effects of climate change on critical infrastructure, improving the usability of scientific literature and ultimately supporting enhanced policy and decision making. To demonstrate this capability, we conduct topic modeling on pairs of climate hazards and NCFs to discover trending topics at the intersection of these categories. This method is useful for analysts and decision-makers to quickly grasp the relevant topics and most important documents linked to the topic.  ( 2 min )
    Target specific peptide design using latent space approximate trajectory collector. (arXiv:2302.01435v1 [cs.CE])
    Despite the prevalence and many successes of deep learning applications in de novo molecular design, the problem of peptide generation targeting specific proteins remains unsolved. A main barrier for this is the scarcity of the high-quality training data. To tackle the issue, we propose a novel machine learning based peptide design architecture, called Latent Space Approximate Trajectory Collector (LSATC). It consists of a series of samplers on an optimization trajectory on a highly non-convex energy landscape that approximates the distributions of peptides with desired properties in a latent space. The process involves little human intervention and can be implemented in an end-to-end manner. We demonstrate the model by the design of peptide extensions targeting Beta-catenin, a key nuclear effector protein involved in canonical Wnt signalling. When compared with a random sampler, LSATC can sample peptides with $36\%$ lower binding scores in a $16$ times smaller interquartile range (IQR) and $284\%$ less hydrophobicity with a $1.4$ times smaller IQR. LSATC also largely outperforms other common generative models. Finally, we utilized a clustering algorithm to select 4 peptides from the 100 LSATC designed peptides for experimental validation. The result confirms that all the four peptides extended by LSATC show improved Beta-catenin binding by at least $20.0\%$, and two of the peptides show a $3$ fold increase in binding affinity as compared to the base peptide.
    Machine Learning Extreme Acoustic Non-reciprocity in a Linear Waveguide with Multiple Nonlinear Asymmetric Gates. (arXiv:2302.01746v1 [eess.AS])
    This work is a study of acoustic non-reciprocity exhibited by a passive one-dimensional linear waveguide incorporating two local strongly nonlinear, asymmetric gates. Two local nonlinear gates break the symmetry and linearity of the waveguide, yielding strong global non-reciprocal acoustics, in the way that extremely different acoustical responses occur depending on the side of application of harmonic excitation. To the authors' best knowledge that the present two-gated waveguide is capable of extremely high acoustic non-reciprocity, at a much higher level to what is reported by active or passive devices in the current literature; moreover, this extreme performance combines with acceptable levels of transmissibility in the desired direction of wave propagation. Machine learning is utilized for predictive design of this gated waveguide in terms of the measures of transmissibility and non-reciprocity, with the aim of reducing the required computational time for high-dimensional parameter space analysis. The study sheds new light into the physics of these media and considers the advantages and limitations of using neural networks to analyze this type of physical problems. In the predicted desirable parameter space for intense non-reciprocity, the maximum transmissibility reaches as much as 40%, and the transmitted energy from upstream to downstream varies up to nine orders of magnitude, depending on the direction of wave transmission. The machine learning tools along with the numerical methods of this work can inform predictive designs of practical non-reciprocal waveguides and acoustic metamaterials that incorporate local nonlinear gates. The current paper shows that combinations of nonlinear gates can lead to extremely high non-reciprocity while maintaining desired levels of transmissibility.  ( 2 min )
    AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge. (arXiv:2302.01738v1 [eess.IV])
    The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.  ( 2 min )
    Transformers in Action Recognition: A Review on Temporal Modeling. (arXiv:2302.01921v1 [cs.CV])
    In vision-based action recognition, spatio-temporal features from different modalities are used for recognizing activities. Temporal modeling is a long challenge of action recognition. However, there are limited methods such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNN) for modeling motion information in deep-based approaches. Recently, transformers success in modeling long-range dependencies in natural language processing (NLP) tasks has gotten great attention from other domains; including speech, image, and video, to rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic within the last few years is astounding. This paper especially reviews recent progress in deep learning methods for modeling temporal variations. It focuses on action recognition methods that use transformers for temporal modeling, discussing their main features, used modalities, and identifying opportunities and challenges for future research.
    Leveraging weak complementary labels to improve semantic segmentation of hepatocellular carcinoma and cholangiocarcinoma in H&E-stained slides. (arXiv:2302.01813v1 [cs.CV])
    In this paper, we present a deep learning segmentation approach to classify and quantify the two most prevalent primary liver cancers - hepatocellular carcinoma and intrahepatic cholangiocarcinoma - from hematoxylin and eosin (H&E) stained whole slide images. While semantic segmentation of medical images typically requires costly pixel-level annotations by domain experts, there often exists additional information which is routinely obtained in clinical diagnostics but rarely utilized for model training. We propose to leverage such weak information from patient diagnoses by deriving complementary labels that indicate to which class a sample cannot belong to. To integrate these labels, we formulate a complementary loss for segmentation. Motivated by the medical application, we demonstrate for general segmentation tasks that including additional patches with solely weak complementary labels during model training can significantly improve the predictive performance and robustness of a model. On the task of diagnostic differentiation between hepatocellular carcinoma and intrahepatic cholangiocarcinoma, we achieve a balanced accuracy of 0.91 (CI 95%: 0.86 - 0.95) at case level for 165 hold-out patients. Furthermore, we also show that leveraging complementary labels improves the robustness of segmentation and increases performance at case level.  ( 2 min )
    Coinductive guide to inductive transformer heads. (arXiv:2302.01834v1 [cs.LG])
    We argue that all building blocks of transformer models can be expressed with a single concept: combinatorial Hopf algebra. Transformer learning emerges as a result of the subtle interplay between the algebraic and coalgebraic operations of the combinatorial Hopf algebra. Viewed through this lens, the transformer model becomes a linear time-invariant system where the attention mechanism computes a generalized convolution transform and the residual stream serves as a unit impulse. Attention-only transformers then learn by enforcing an invariant between these two paths. We call this invariant Hopf coherence. Due to this, with a degree of poetic license, one could call combinatorial Hopf algebras "tensors with a built-in loss function gradient". This loss function gradient occurs within the single layers and no backward pass is needed. This is in contrast to automatic differentiation which happens across the whole graph and needs a explicit backward pass. This property is the result of the fact that combinatorial Hopf algebras have the surprising property of calculating eigenvalues by repeated squaring.  ( 2 min )
    Learning finite difference methods for reaction-diffusion type equations with FCNN. (arXiv:2201.01854v2 [cs.LG] UPDATED)
    In recent years, Physics-informed neural networks (PINNs) have been widely used to solve partial differential equations alongside numerical methods because PINNs can be trained without observations and deal with continuous-time problems directly. In contrast, optimizing the parameters of such models is difficult, and individual training sessions must be performed to predict the evolutions of each different initial condition. To alleviate the first problem, observed data can be injected directly into the loss function part. To solve the second problem, a network architecture can be built as a framework to learn a finite difference method. In view of the two motivations, we propose Five-point stencil CNNs (FCNNs) containing a five-point stencil kernel and a trainable approximation function for reaction-diffusion type equations including the heat, Fisher's, Allen-Cahn, and other reaction-diffusion equations with trigonometric function terms. We show that FCNNs can learn finite difference schemes using few data and achieve the low relative errors of diverse reaction-diffusion evolutions with unseen initial conditions. Furthermore, we demonstrate that FCNNs can still be trained well even with using noisy data.
    From slides (through tiles) to pixels: an explainability framework for weakly supervised models in pre-clinical pathology. (arXiv:2302.01653v1 [cs.CV])
    In pre-clinical pathology, there is a paradox between the abundance of raw data (whole slide images from many organs of many individual animals) and the lack of pixel-level slide annotations done by pathologists. Due to time constraints and requirements from regulatory authorities, diagnoses are instead stored as slide labels. Weakly supervised training is designed to take advantage of those data, and the trained models can be used by pathologists to rank slides by their probability of containing a given lesion of interest. In this work, we propose a novel contextualized eXplainable AI (XAI) framework and its application to deep learning models trained on Whole Slide Images (WSIs) in Digital Pathology. Specifically, we apply our methods to a multi-instance-learning (MIL) model, which is trained solely on slide-level labels, without the need for pixel-level annotations. We validate quantitatively our methods by quantifying the agreements of our explanations' heatmaps with pathologists' annotations, as well as with predictions from a segmentation model trained on such annotations. We demonstrate the stability of the explanations with respect to input shifts, and the fidelity with respect to increased model performance. We quantitatively evaluate the correlation between available pixel-wise annotations and explainability heatmaps. We show that the explanations on important tiles of the whole slide correlate with tissue changes between healthy regions and lesions, but do not exactly behave like a human annotator. This result is coherent with the model training strategy.  ( 2 min )
    Fixing by Mixing: A Recipe for Optimal Byzantine ML under Heterogeneity. (arXiv:2302.01772v1 [cs.LG])
    Byzantine machine learning (ML) aims to ensure the resilience of distributed learning algorithms to misbehaving (or Byzantine) machines. Although this problem received significant attention, prior works often assume the data held by the machines to be homogeneous, which is seldom true in practical settings. Data heterogeneity makes Byzantine ML considerably more challenging, since a Byzantine machine can hardly be distinguished from a non-Byzantine outlier. A few solutions have been proposed to tackle this issue, but these provide suboptimal probabilistic guarantees and fare poorly in practice. This paper closes the theoretical gap, achieving optimality and inducing good empirical results. In fact, we show how to automatically adapt existing solutions for (homogeneous) Byzantine ML to the heterogeneous setting through a powerful mechanism, we call nearest neighbor mixing (NNM), which boosts any standard robust distributed gradient descent variant to yield optimal Byzantine resilience under heterogeneity. We obtain similar guarantees (in expectation) by plugging NNM in the distributed stochastic heavy ball method, a practical substitute to distributed gradient descent. We obtain empirical results that significantly outperform state-of-the-art Byzantine ML solutions.  ( 2 min )
    Avalanche: A PyTorch Library for Deep Continual Learning. (arXiv:2302.01766v1 [cs.LG])
    Continual learning is the problem of learning from a nonstationary stream of data, a fundamental issue for sustainable and efficient training of deep neural networks over time. Unfortunately, deep learning libraries only provide primitives for offline training, assuming that model's architecture and data are fixed. Avalanche is an open source library maintained by the ContinualAI non-profit organization that extends PyTorch by providing first-class support for dynamic architectures, streams of datasets, and incremental training and evaluation methods. Avalanche provides a large set of predefined benchmarks and training algorithms and it is easy to extend and modular while supporting a wide range of continual learning scenarios. Documentation is available at \url{https://avalanche.continualai.org}.  ( 2 min )
    Certified Robustness of Learning-based Static Malware Detectors. (arXiv:2302.01757v1 [cs.CR])
    Certified defenses are a recent development in adversarial machine learning (ML), which aim to rigorously guarantee the robustness of ML models to adversarial perturbations. A large body of work studies certified defenses in computer vision, where $\ell_p$ norm-bounded evasion attacks are adopted as a tractable threat model. However, this threat model has known limitations in vision, and is not applicable to other domains -- e.g., where inputs may be discrete or subject to complex constraints. Motivated by this gap, we study certified defenses for malware detection, a domain where attacks against ML-based systems are a real and current threat. We consider static malware detection systems that operate on byte-level data. Our certified defense is based on the approach of randomized smoothing which we adapt by: (1) replacing the standard Gaussian randomization scheme with a novel deletion randomization scheme that operates on bytes or chunks of an executable; and (2) deriving a certificate that measures robustness to evasion attacks in terms of generalized edit distance. To assess the size of robustness certificates that are achievable while maintaining high accuracy, we conduct experiments on malware datasets using a popular convolutional malware detection model, MalConv. We are able to accurately classify 91% of the inputs while being certifiably robust to any adversarial perturbations of edit distance 128 bytes or less. By comparison, an existing certification of up to 128 bytes of substitutions (without insertions or deletions) achieves an accuracy of 78%. In addition, given that robustness certificates are conservative, we evaluate practical robustness to several recently published evasion attacks and, in some cases, find robustness beyond certified guarantees.  ( 2 min )
    Mind the Gap: Offline Policy Optimization for Imperfect Rewards. (arXiv:2302.01667v1 [cs.LG])
    Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, \textit{RGM (Reward Gap Minimization)}, which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.  ( 2 min )
    BackdoorBox: A Python Toolbox for Backdoor Learning. (arXiv:2302.01762v1 [cs.CR])
    Third-party resources ($e.g.$, samples, backbones, and pre-trained models) are usually involved in the training of deep neural networks (DNNs), which brings backdoor attacks as a new training-phase threat. In general, backdoor attackers intend to implant hidden backdoor in DNNs, so that the attacked DNNs behave normally on benign samples whereas their predictions will be maliciously changed to a pre-defined target label if hidden backdoors are activated by attacker-specified trigger patterns. To facilitate the research and development of more secure training schemes and defenses, we design an open-sourced Python toolbox that implements representative and advanced backdoor attacks and defenses under a unified and flexible framework. Our toolbox has four important and promising characteristics, including consistency, simplicity, flexibility, and co-development. It allows researchers and developers to easily implement and compare different methods on benchmark or their local datasets. This Python toolbox, namely \texttt{BackdoorBox}, is available at \url{https://github.com/THUYimingLi/BackdoorBox}.  ( 2 min )
    Creating Probabilistic Forecasts from Arbitrary Deterministic Forecasts using Conditional Invertible Neural Networks. (arXiv:2302.01800v1 [cs.LG])
    In various applications, probabilistic forecasts are required to quantify the inherent uncertainty associated with the forecast. However, numerous modern forecasting methods are still designed to create deterministic forecasts. Transforming these deterministic forecasts into probabilistic forecasts is often challenging and based on numerous assumptions that may not hold in real-world situations. Therefore, the present article proposes a novel approach for creating probabilistic forecasts from arbitrary deterministic forecasts. In order to implement this approach, we use a conditional Invertible Neural Network (cINN). More specifically, we apply a cINN to learn the underlying distribution of the data and then combine the uncertainty from this distribution with an arbitrary deterministic forecast to generate accurate probabilistic forecasts. Our approach enables the simple creation of probabilistic forecasts without complicated statistical loss functions or further assumptions. Besides showing the mathematical validity of our approach, we empirically show that our approach noticeably outperforms traditional methods for including uncertainty in deterministic forecasts and generally outperforms state-of-the-art probabilistic forecasting benchmarks.  ( 2 min )
    Leveraging a Probabilistic PCA Model to Understand the Multivariate Statistical Network Monitoring Framework for Network Security Anomaly Detection. (arXiv:2302.01759v1 [stat.ML])
    Network anomaly detection is a very relevant research area nowadays, especially due to its multiple applications in the field of network security. The boost of new models based on variational autoencoders and generative adversarial networks has motivated a reevaluation of traditional techniques for anomaly detection. It is, however, essential to be able to understand these new models from the perspective of the experience attained from years of evaluating network security data for anomaly detection. In this paper, we revisit anomaly detection techniques based on PCA from a probabilistic generative model point of view, and contribute a mathematical model that relates them. Specifically, we start with the probabilistic PCA model and explain its connection to the Multivariate Statistical Network Monitoring (MSNM) framework. MSNM was recently successfully proposed as a means of incorporating industrial process anomaly detection experience into the field of networking. We have evaluated the mathematical model using two different datasets. The first, a synthetic dataset created to better understand the analysis proposed, and the second, UGR'16, is a specifically designed real-traffic dataset for network security anomaly detection. We have drawn conclusions that we consider to be useful when applying generative models to network security detection.
    DEUP: Direct Epistemic Uncertainty Prediction. (arXiv:2102.08501v4 [cs.LG] UPDATED)
    Epistemic Uncertainty is a measure of the lack of knowledge of a learner which diminishes with more evidence. While existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, we argue that this does not capture the part of lack of knowledge induced by model misspecification. We discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor, is a sound measure of epistemic uncertainty which captures the effect of model misspecification. We thus propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. We discuss the merits of this novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty and addresses its major pitfall. Our framework, Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round. Through a wide set of experiments, we illustrate how existing methods in sequential model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations.
    Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers. (arXiv:2302.01925v1 [cs.LG])
    We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for nongeometric data, as well as novel RPEs operating on the sequences of tokens embedded in higher-dimensional Euclidean spaces (e.g. point clouds). FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE-mask. FLTs allow also for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and providing accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly tested FLTs on other data modalities and tasks, such as: image classification and 3D molecular modeling. For 3D-data FLTs are, to the best of our knowledge, the first Transformers architectures providing RPE-enhanced linear attention.
    Show me your NFT and I tell you how it will perform: Multimodal representation learning for NFT selling price prediction. (arXiv:2302.01676v1 [cs.LG])
    Non-Fungible Tokens (NFTs) represent deeds of ownership, based on blockchain technologies and smart contracts, of unique crypto assets on digital art forms (e.g., artworks or collectibles). In the spotlight after skyrocketing in 2021, NFTs have attracted the attention of crypto enthusiasts and investors intent on placing promising investments in this profitable market. However, the NFT financial performance prediction has not been widely explored to date. In this work, we address the above problem based on the hypothesis that NFT images and their textual descriptions are essential proxies to predict the NFT selling prices. To this purpose, we propose MERLIN, a novel multimodal deep learning framework designed to train Transformer-based language and visual models, along with graph neural network models, on collections of NFTs' images and texts. A key aspect in MERLIN is its independence on financial features, as it exploits only the primary data a user interested in NFT trading would like to deal with, i.e., NFT images and textual descriptions. By learning dense representations of such data, a price-category classification task is performed by MERLIN models, which can also be tuned according to user preferences in the inference phase to mimic different risk-return investment profiles. Experimental evaluation on a publicly available dataset has shown that MERLIN models achieve significant performances according to several financial assessment criteria, fostering profitable investments, and also beating baseline machine-learning classifiers based on financial features.  ( 2 min )
    Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. (arXiv:2302.01771v1 [stat.ML])
    Deep learning (DL) has emerged as a promising tool to downscale climate projections at regional-to-local scales from large-scale atmospheric fields following the perfect-prognosis (PP) approach. Given their complexity, it is crucial to properly evaluate these methods, especially when applied to changing climatic conditions where the ability to extrapolate/generalise is key. In this work, we intercompare several DL models extracted from the literature for the same challenging use-case (downscaling temperature in the CORDEX North America domain) and expand standard evaluation methods building on eXplainable artifical intelligence (XAI) techniques. We show how these techniques can be used to unravel the internal behaviour of these models, providing new evaluation dimensions and aiding in their diagnostic and design. These results show the usefulness of incorporating XAI techniques into statistical downscaling evaluation frameworks, especially when working with large regions and/or under climate change conditions.
    Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies. (arXiv:2302.01734v1 [cs.LG])
    Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
    Reinforcing User Retention in a Billion Scale Short Video Recommender System. (arXiv:2302.01724v1 [cs.LG])
    Recently, short video platforms have achieved rapid user growth by recommending interesting content to users. The objective of the recommendation is to optimize user retention, thereby driving the growth of DAU (Daily Active Users). Retention is a long-term feedback after multiple interactions of users and the system, and it is hard to decompose retention reward to each item or a list of items. Thus traditional point-wise and list-wise models are not able to optimize retention. In this paper, we choose reinforcement learning methods to optimize the retention as they are designed to maximize the long-term performance. We formulate the problem as an infinite-horizon request-based Markov Decision Process, and our objective is to minimize the accumulated time interval of multiple sessions, which is equal to improving the app open frequency and user retention. However, current reinforcement learning algorithms can not be directly applied in this setting due to uncertainty, bias, and long delay time incurred by the properties of user retention. We propose a novel method, dubbed RLUR, to address the aforementioned challenges. Both offline and live experiments show that RLUR can significantly improve user retention. RLUR has been fully launched in Kuaishou app for a long time, and achieves consistent performance improvement on user retention and DAU.  ( 2 min )
    Distributional constrained reinforcement learning for supply chain optimization. (arXiv:2302.01727v1 [cs.LG])
    This work studies reinforcement learning (RL) in the context of multi-period supply chains subject to constraints, e.g., on production and inventory. We introduce Distributional Constrained Policy Optimization (DCPO), a novel approach for reliable constraint satisfaction in RL. Our approach is based on Constrained Policy Optimization (CPO), which is subject to approximation errors that in practice lead it to converge to infeasible policies. We address this issue by incorporating aspects of distributional RL into DCPO. Specifically, we represent the return and cost value functions using neural networks that output discrete distributions, and we reshape costs based on the associated confidence. Using a supply chain case study, we show that DCPO improves the rate at which the RL policy converges and ensures reliable constraint satisfaction by the end of training. The proposed method also improves predictability, greatly reducing the variance of returns between runs, respectively; this result is significant in the context of policy gradient methods, which intrinsically introduce significant variance during training.  ( 2 min )
    Interpretations of Domain Adaptations via Layer Variational Analysis. (arXiv:2302.01798v1 [cs.LG])
    Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with corresponding data conditions. Moreover, our theoretical calculation yields intuitive interpretations towards the knowledge transfer process. Subsequently, an alternative method for network-based transfer learning is derived. The method shows an increase in efficiency and accuracy for domain adaptation. It is particularly advantageous when new domain data is sufficiently sparse during adaptation. Numerical experiments over diverse tasks validated our theory and verified that our analytic expression achieved better performance in domain adaptation than the gradient descent method.  ( 2 min )
    Command Line Interface Risk Modeling. (arXiv:2302.01749v1 [cs.CR])
    Protecting sensitive data is an essential part of security in cloud computing. However, only specific privileged individuals have access to view or interact with this data; therefore, it is unscalable to depend on these individuals also to maintain the software. A solution to this is to allow non-privileged individuals access to maintain these systems but mask sensitive information from egressing. To this end, we have created a machine-learning model to predict and redact fields with sensitive data. This work concentrates on Azure PowerShell, showing how it applies to other command-line interfaces and APIs. Using the F5-score as a weighted metric, we demonstrate different transformation techniques to map this problem from an unknown field to the well-researched area of natural language processing.  ( 2 min )
    Motion ID: Human Authentication Approach. (arXiv:2302.01751v1 [cs.CR])
    We introduce a novel approach to user authentication called Motion ID. The method employs motion sensing provided by inertial measurement units (IMUs), using it to verify the persons identity via short time series of IMU data captured by the mobile device. The paper presents two labeled datasets with unlock events: the first features IMU measurements, provided by six users who continuously collected data on six different smartphones for a period of 12 weeks. The second one contains 50 hours of IMU data for one specific motion pattern, provided by 101 users. Moreover, we present a two-stage user authentication process that employs motion pattern identification and user verification and is based on data preprocessing and machine learning. The Results section details the assessment of the method proposed, comparing it with existing biometric authentication methods and the Android biometric standard. The method has demonstrated high accuracy, indicating that it could be successfully used in combination with existing methods. Furthermore, the method exhibits significant promise as a standalone solution. We provide the datasets to the scholarly community and share our project code.  ( 2 min )
    Leveraging Contaminated Datasets to Learn Clean-Data Distribution with Purified Generative Adversarial Networks. (arXiv:2302.01722v1 [cs.LG])
    Generative adversarial networks (GANs) are known for their strong abilities on capturing the underlying distribution of training instances. Since the seminal work of GAN, many variants of GAN have been proposed. However, existing GANs are almost established on the assumption that the training dataset is clean. But in many real-world applications, this may not hold, that is, the training dataset may be contaminated by a proportion of undesired instances. When training on such datasets, existing GANs will learn a mixture distribution of desired and contaminated instances, rather than the desired distribution of desired data only (target distribution). To learn the target distribution from contaminated datasets, two purified generative adversarial networks (PuriGAN) are developed, in which the discriminators are augmented with the capability to distinguish between target and contaminated instances by leveraging an extra dataset solely composed of contamination instances. We prove that under some mild conditions, the proposed PuriGANs are guaranteed to converge to the distribution of desired instances. Experimental results on several datasets demonstrate that the proposed PuriGANs are able to generate much better images from the desired distribution than comparable baselines when trained on contaminated datasets. In addition, we also demonstrate the usefulness of PuriGAN on downstream applications by applying it to the tasks of semi-supervised anomaly detection on contaminated datasets and PU-learning. Experimental results show that PuriGAN is able to deliver the best performance over comparable baselines on both tasks.  ( 2 min )
    Better Training of GFlowNets with Local Credit and Incomplete Trajectories. (arXiv:2302.01687v1 [cs.LG])
    Generative Flow Networks or GFlowNets are related to Monte-Carlo Markov chain methods (as they sample from a distribution specified by an energy function), reinforcement learning (as they learn a policy to sample composed objects through a sequence of steps), generative models (as they learn to represent and sample from a distribution) and amortized variational methods (as they can be used to learn to approximate and sample from an otherwise intractable posterior, given a prior and a likelihood). They are trained to generate an object $x$ through a sequence of steps with probability proportional to some reward function $R(x)$ (or $\exp(-\mathcal{E}(x))$ with $\mathcal{E}(x)$ denoting the energy function), given at the end of the generative trajectory. Like for other RL settings where the reward is only given at the end, the efficiency of training and credit assignment may suffer when those trajectories are longer. With previous GFlowNet work, no learning was possible from incomplete trajectories (lacking a terminal state and the computation of the associated reward). In this paper, we consider the case where the energy function can be applied not just to terminal states but also to intermediate states. This is for example achieved when the energy function is additive, with terms available along the trajectory. We show how to reparameterize the GFlowNet state flow function to take advantage of the partial reward already accrued at each state. This enables a training objective that can be applied to update parameters even with incomplete trajectories. Even when complete trajectories are available, being able to obtain more localized credit and gradients is found to speed up training convergence, as demonstrated across many simulations.  ( 2 min )
    Learning End-to-End Channel Coding with Diffusion Models. (arXiv:2302.01714v1 [cs.IT])
    It is a known problem that deep-learning-based end-to-end (E2E) channel coding systems depend on a known and differentiable channel model, due to the learning process and based on the gradient-descent optimization methods. This places the challenge to approximate or generate the channel or its derivative from samples generated by pilot signaling in real-world scenarios. Currently, there are two prevalent methods to solve this problem. One is to generate the channel via a generative adversarial network (GAN), and the other is to, in essence, approximate the gradient via reinforcement learning methods. Other methods include using score-based methods, variational autoencoders, or mutual-information-based methods. In this paper, we focus on generative models and, in particular, on a new promising method called diffusion models, which have shown a higher quality of generation in image-based tasks. We will show that diffusion models can be used in wireless E2E scenarios and that they work as good as Wasserstein GANs while having a more stable training procedure and a better generalization ability in testing.  ( 2 min )
    Where and How to Improve Graph-based Spatio-temporal Predictors. (arXiv:2302.01701v1 [stat.ML])
    This paper introduces a novel residual correlation analysis, called AZ-analysis, to assess the optimality of spatio-temporal predictive models. The proposed AZ-analysis constitutes a valuable asset for discovering and highlighting those space-time regions where the model can be improved with respect to performance. The AZ-analysis operates under very mild assumptions and is based on a spatio-temporal graph that encodes serial and functional dependencies in the data; asymptotically distribution-free summary statistics identify existing residual correlation in space and time regions, hence localizing time frames and/or communities of sensors, where the predictor can be improved.  ( 2 min )
    Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective. (arXiv:2302.01735v1 [cs.CV])
    For medical image segmentation, contrastive learning is the dominant practice to improve the quality of visual representations by contrasting semantically similar and dissimilar pairs of samples. This is enabled by the observation that without accessing ground truth label, negative examples with truly dissimilar anatomical features, if sampled, can significantly improve the performance. In reality, however, these samples may come from similar anatomical features and the models may struggle to distinguish the minority tail-class samples, making the tail classes more prone to misclassification, both of which typically lead to model collapse. In this paper, we propose ARCO, a semi-supervised contrastive learning (CL) framework with stratified group sampling theory in medical image segmentation. In particular, we first propose building ARCO through the concept of variance-reduced estimation, and show that certain variance-reduction techniques are particularly beneficial in medical image segmentation tasks with extremely limited labels. Furthermore, we theoretically prove these sampling techniques are universal in variance reduction. Finally, we experimentally validate our approaches on three benchmark datasets with different label settings, and our methods consistently outperform state-of-the-art semi- and fully-supervised methods. Additionally, we augment the CL frameworks with these sampling techniques and demonstrate significant gains over previous methods. We believe our work is an important step towards semi-supervised medical image segmentation by quantifying the limitation of current self-supervision objectives for accomplishing medical image analysis tasks.  ( 2 min )
    A Systematic Evaluation of Backdoor Trigger Characteristics in Image Classification. (arXiv:2302.01740v1 [cs.CV])
    Deep learning achieves outstanding results in many machine learning tasks. Nevertheless, it is vulnerable to backdoor attacks that modify the training set to embed a secret functionality in the trained model. The modified training samples have a secret property, i.e., a trigger. At inference time, the secret functionality is activated when the input contains the trigger, while the model functions correctly in other cases. While there are many known backdoor attacks (and defenses), deploying a stealthy attack is still far from trivial. Successfully creating backdoor triggers heavily depends on numerous parameters. Unfortunately, research has not yet determined which parameters contribute most to the attack performance. This paper systematically analyzes the most relevant parameters for the backdoor attacks, i.e., trigger size, position, color, and poisoning rate. Using transfer learning, which is very common in computer vision, we evaluate the attack on numerous state-of-the-art models (ResNet, VGG, AlexNet, and GoogLeNet) and datasets (MNIST, CIFAR10, and TinyImageNet). Our attacks cover the majority of backdoor settings in research, providing concrete directions for future works. Our code is publicly available to facilitate the reproducibility of our results.  ( 2 min )
    Improving the Timing Resolution of Positron Emission Tomography Detectors using Boosted Learning -- A Residual Physics Approach. (arXiv:2302.01681v1 [cs.LG])
    Artificial intelligence is finding its way into medical imaging, usually focusing on image reconstruction or enhancing analytical reconstructed images. However, optimizations along the complete processing chain, from detecting signals to computing data, enable significant improvements. Thus, we present an approach toward detector optimization using boosted learning by exploiting the concept of residual physics. In our work, we improve the coincidence time resolution (CTR) of positron emission tomography (PET) detectors. PET enables imaging of metabolic processes by detecting {\gamma}-photons with scintillation detectors. Current research exploits light-sharing detectors, where the scintillation light is distributed over and digitized by an array of readout channels. While these detectors demonstrate excellent performance parameters, e.g., regarding spatial resolution, extracting precise timing information for time-of-flight (TOF) becomes more challenging due to deteriorating effects called time skews. Conventional correction methods mainly rely on analytical formulations, theoretically capable of covering all time skew effects, e.g., caused by signal runtimes or physical effects. However, additional effects are involved for light-sharing detectors, so finding suitable analytical formulations can become arbitrarily complicated. The residual physics-based strategy uses gradient tree boosting (GTB) and a physics-informed data generation mimicking an actual imaging process by shifting a radiation source. We used clinically relevant detectors with a height of 19 mm, coupled to digital photosensor arrays. All trained models improved the CTR significantly. Using the best model, we achieved CTRs down to 198 ps (185 ps) for energies ranging from 300 keV to 700 keV (450 keV to 550 keV).  ( 2 min )
    Structure-informed Language Models Are Protein Designers. (arXiv:2302.01649v1 [cs.LG])
    This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that our approach outperforms the state-of-the-art methods by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65% and 56.63% on CATH 4.2 and 4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)  ( 2 min )
    Two-Stage Constrained Actor-Critic fo Short Video Recommendation. (arXiv:2302.01680v1 [cs.LG])
    The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users' cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.  ( 2 min )
    GTV: Generating Tabular Data via Vertical Federated Learning. (arXiv:2302.01706v1 [cs.LG])
    Generative Adversarial Networks (GANs) have achieved state-of-the-art results in tabular data synthesis, under the presumption of direct accessible training data. Vertical Federated Learning (VFL) is a paradigm which allows to distributedly train machine learning model with clients possessing unique features pertaining to the same individuals, where the tabular data learning is the primary use case. However, it is unknown if tabular GANs can be learned in VFL. Demand for secure data transfer among clients and GAN during training and data synthesizing poses extra challenge. Conditional vector for tabular GANs is a valuable tool to control specific features of generated data. But it contains sensitive information from real data - risking privacy guarantees. In this paper, we propose GTV, a VFL framework for tabular GANs, whose key components are generator, discriminator and the conditional vector. GTV proposes an unique distributed training architecture for generator and discriminator to access training data in a privacy-preserving manner. To accommodate conditional vector into training without privacy leakage, GTV designs a mechanism training-with-shuffling to ensure that no party can reconstruct training data with conditional vector. We evaluate the effectiveness of GTV in terms of synthetic data quality, and overall training scalability. Results show that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by centralized GAN algorithm. The difference on machine learning utility can be as low as to 2.7%, even under extremely imbalanced data distributions across clients and different number of clients.  ( 2 min )
    Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks. (arXiv:2302.01677v1 [cs.LG])
    In this work, besides improving prediction accuracy, we study whether personalization could bring robustness benefits to backdoor attacks. We conduct the first study of backdoor attacks in the pFL framework, testing 4 widely used backdoor attacks against 6 pFL methods on benchmark datasets FEMNIST and CIFAR-10, a total of 600 experiments. The study shows that pFL methods with partial model-sharing can significantly boost robustness against backdoor attacks. In contrast, pFL methods with full model-sharing do not show robustness. To analyze the reasons for varying robustness performances, we provide comprehensive ablation studies on different pFL methods. Based on our findings, we further propose a lightweight defense method, Simple-Tuning, which empirically improves defense performance against backdoor attacks. We believe that our work could provide both guidance for pFL application in terms of its robustness and offer valuable insights to design more robust FL methods in the future.  ( 2 min )
    Private, fair and accurate: Training large-scale, privacy-preserving AI models in radiology. (arXiv:2302.01622v1 [eess.IV])
    Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure the protection of said data are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. However, prior work has shown that DP has negative implications on model accuracy and fairness. Therefore, the purpose of this study is to demonstrate that the privacy-preserving training of AI models for chest radiograph diagnosis is possible with high accuracy and fairness compared to non-private training. N=193,311 high quality clinical chest radiographs were retrospectively collected and manually labeled by experienced radiologists, who assigned one or more of the following diagnoses: cardiomegaly, congestion, pleural effusion, pneumonic infiltration and atelectasis, to each side (where applicable). The non-private AI models were compared with privacy-preserving (DP) models with respect to privacy-utility trade-offs (measured as area under the receiver-operator-characteristic curve (AUROC)), and privacy-fairness trade-offs (measured as Pearson-R or Statistical Parity Difference). The non-private AI model achieved an average AUROC score of 0.90 over all labels, whereas the DP AI model with a privacy budget of epsilon=7.89 resulted in an AUROC of 0.87, i.e., a mere 2.6% performance decrease compared to non-private training. The privacy-preserving training of diagnostic AI models can achieve high performance with a small penalty on model accuracy and does not amplify discrimination against age, sex or co-morbidity. We thus encourage practitioners to integrate state-of-the-art privacy-preserving techniques into medical AI model development.  ( 2 min )
    A Feature Selection Method for Driver Stress Detection Using Heart Rate Variability and Breathing Rate. (arXiv:2302.01602v1 [cs.LG])
    Driver stress is a major cause of car accidents and death worldwide. Furthermore, persistent stress is a health problem, contributing to hypertension and other diseases of the cardiovascular system. Stress has a measurable impact on heart and breathing rates and stress levels can be inferred from such measurements. Galvanic skin response is a common test to measure the perspiration caused by both physiological and psychological stress, as well as extreme emotions. In this paper, galvanic skin response is used to estimate the ground truth stress levels. A feature selection technique based on the minimal redundancy-maximal relevance method is then applied to multiple heart rate variability and breathing rate metrics to identify a novel and optimal combination for use in detecting stress. The support vector machine algorithm with a radial basis function kernel was used along with these features to reliably predict stress. The proposed method has achieved a high level of accuracy on the target dataset.  ( 2 min )
    Convergence Analysis of Split Learning on Non-IID Data. (arXiv:2302.01633v1 [cs.LG])
    Split Learning (SL) is one promising variant of Federated Learning (FL), where the AI model is split and trained at the clients and the server collaboratively. By offloading the computation-intensive portions to the server, SL enables efficient model training on resource-constrained clients. Despite its booming applications, SL still lacks rigorous convergence analysis on non-IID data, which is critical for hyperparameter selection. In this paper, we first prove that SL exhibits an $\mathcal{O}(1/\sqrt{R})$ convergence rate for non-convex objectives on non-IID data, where $R$ is the number of total training rounds. The derived convergence results can facilitate understanding the effect of some crucial factors in SL (e.g., data heterogeneity and synchronization interval). Furthermore, comparing with the convergence result of FL, we show that the guarantee of SL is worse than FL in terms of training rounds on non-IID data. The experimental results verify our theory. More findings on the comparison between FL and SL in cross-device settings are also reported.  ( 2 min )
    Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. (arXiv:2302.01629v1 [stat.ML])
    Machine learning models are vulnerable to adversarial perturbations, and a thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon through the lens of over-parameterization: interpolating smoothly the data requires significantly more parameters than simply memorizing it. However, this "universal" law provides only a necessary condition for robustness, and it is unable to discriminate between models. In this paper, we address these gaps by focusing on empirical risk minimization in two prototypical settings, namely, random features and the neural tangent kernel (NTK). We prove that, for random features, the model is not robust for any degree of over-parameterization, even when the necessary condition coming from the universal law of robustness is satisfied. In contrast, for even activations, the NTK model meets the universal lower bound, and it is robust as soon as the necessary condition on over-parameterization is fulfilled. This also addresses a conjecture in prior work by Bubeck, Li and Nagaraj. Our analysis decouples the effect of the kernel of the model from an "interaction matrix", which describes the interaction with the test data and captures the effect of the activation. Our theoretical results are corroborated by numerical evidence on both synthetic and standard datasets (MNIST, CIFAR-10).  ( 2 min )
    SCCAM: Supervised Contrastive Convolutional Attention Mechanism for Ante-hoc Interpretable Fault Diagnosis with Limited Fault Samples. (arXiv:2302.01599v1 [cs.LG])
    In real industrial processes, fault diagnosis methods are required to learn from limited fault samples since the procedures are mainly under normal conditions and the faults rarely occur. Although attention mechanisms have become popular in the field of fault diagnosis, the existing attention-based methods are still unsatisfying for the above practical applications. First, pure attention-based architectures like transformers need a large number of fault samples to offset the lack of inductive biases thus performing poorly under limited fault samples. Moreover, the poor fault classification dilemma further leads to the failure of the existing attention-based methods to identify the root causes. To address the aforementioned issues, we innovatively propose a supervised contrastive convolutional attention mechanism (SCCAM) with ante-hoc interpretability, which solves the root cause analysis problem under limited fault samples for the first time. The proposed SCCAM method is tested on a continuous stirred tank heater and the Tennessee Eastman industrial process benchmark. Three common fault diagnosis scenarios are covered, including a balanced scenario for additional verification and two scenarios with limited fault samples (i.e., imbalanced scenario and long-tail scenario). The comprehensive results demonstrate that the proposed SCCAM method can achieve better performance compared with the state-of-the-art methods on fault classification and root cause analysis.  ( 2 min )
    A Novel Fuzzy Bi-Clustering Algorithm with AFS for Identification of Co-Regulated Genes. (arXiv:2302.01596v1 [cs.LG])
    The identification of co-regulated genes and their transcription-factor binding sites (TFBS) are the key steps toward understanding transcription regulation. In addition to effective laboratory assays, various bi-clustering algorithms for detection of the co-expressed genes have been developed. Bi-clustering methods are used to discover subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. By building two fuzzy partition matrices of the gene expression data with the Axiomatic Fuzzy Set (AFS) theory, this paper proposes a novel fuzzy bi-clustering algorithm for identification of co-regulated genes. Specifically, the gene expression data is transformed into two fuzzy partition matrices via sub-preference relations theory of AFS at first. One of the matrices is considering the genes as the universe and the conditions as the concept, the other one is considering the genes as the concept and the conditions as the universe. The identification of the co-regulated genes (bi-clusters) is carried out on the two partition matrices at the same time. Then, a novel fuzzy-based similarity criterion is defined based on the partition matrixes, and a cyclic optimization algorithm is designed to discover the significant bi-clusters at expression level. The above procedures guarantee that the generated bi-clusters have more significant expression values than that of extracted by the traditional bi-clustering methods. Finally, the performance of the proposed method is evaluated with the performance of the three well-known bi-clustering algorithms on publicly available real microarray datasets. The experimental results are in agreement with the theoretical analysis and show that the proposed algorithm can effectively detect the co-regulated genes without any prior knowledge of the gene expression data.  ( 2 min )
    Learning to Decouple Complex Systems. (arXiv:2302.01581v1 [cs.LG])
    A complex system with cluttered observations may be a coupled mixture of multiple simple sub-systems corresponding to latent entities. Such sub-systems may hold distinct dynamics in the continuous-time domain; therein, complicated interactions between sub-systems also evolve over time. This setting is fairly common in the real world but has been less considered. In this paper, we propose a sequential learning approach under this setting by decoupling a complex system for handling irregularly sampled and cluttered sequential observations. Such decoupling brings about not only subsystems describing the dynamics of each latent entity but also a meta-system capturing the interaction between entities over time. Specifically, we argue that the meta-system evolving within a simplex is governed by projected differential equations (ProjDEs). We further analyze and provide neural-friendly projection operators in the context of Bregman divergence. Experimental results on synthetic and real-world datasets show the advantages of our approach when facing complex and cluttered sequential data compared to the state-of-the-art.  ( 2 min )
    Deep Reinforcement Learning for Cyber System Defense under Dynamic Adversarial Uncertainties. (arXiv:2302.01595v1 [cs.LG])
    Development of autonomous cyber system defense strategies and action recommendations in the real-world is challenging, and includes characterizing system state uncertainties and attack-defense dynamics. We propose a data-driven deep reinforcement learning (DRL) framework to learn proactive, context-aware, defense countermeasures that dynamically adapt to evolving adversarial behaviors while minimizing loss of cyber system operations. A dynamic defense optimization problem is formulated with multiple protective postures against different types of adversaries with varying levels of skill and persistence. A custom simulation environment was developed and experiments were devised to systematically evaluate the performance of four model-free DRL algorithms against realistic, multi-stage attack sequences. Our results suggest the efficacy of DRL algorithms for proactive cyber defense under multi-stage attack profiles and system uncertainties.  ( 2 min )
    Blockwise Self-Supervised Learning at Scale. (arXiv:2302.01647v1 [cs.CV])
    Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.  ( 2 min )
    An Operational Perspective to Fairness Interventions: Where and How to Intervene. (arXiv:2302.01574v1 [cs.LG])
    As AI-based decision systems proliferate, their successful operationalization requires balancing multiple desiderata: predictive performance, disparity across groups, safeguarding sensitive group attributes (e.g., race), and engineering cost. We present a holistic framework for evaluating and contextualizing fairness interventions with respect to the above desiderata. The two key points of practical consideration are where (pre-, in-, post-processing) and how (in what way the sensitive group data is used) the intervention is introduced. We demonstrate our framework using a thorough benchmarking study on predictive parity; we study close to 400 methodological variations across two major model types (XGBoost vs. Neural Net) and ten datasets. Methodological insights derived from our empirical study inform the practical design of ML workflow with fairness as a central concern. We find predictive parity is difficult to achieve without using group data, and despite requiring group data during model training (but not inference), distributionally robust methods provide significant Pareto improvement. Moreover, a plain XGBoost model often Pareto-dominates neural networks with fairness interventions, highlighting the importance of model inductive bias.  ( 2 min )
    ResMem: Learn what you can and memorize the rest. (arXiv:2302.01576v1 [cs.LG])
    The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.  ( 2 min )
    Uniform tensor clustering by jointly exploring sample affinities of various orders. (arXiv:2302.01569v1 [cs.LG])
    Conventional clustering methods based on pairwise affinity usually suffer from the concentration effect while processing huge dimensional features yet low sample sizes data, resulting in inaccuracy to encode the sample proximity and suboptimal performance in clustering. To address this issue, we propose a unified tensor clustering method (UTC) that characterizes sample proximity using multiple samples' affinity, thereby supplementing rich spatial sample distributions to boost clustering. Specifically, we find that the triadic tensor affinity can be constructed via the Khari-Rao product of two affinity matrices. Furthermore, our early work shows that the fourth-order tensor affinity is defined by the Kronecker product. Therefore, we utilize arithmetical products, Khatri-Rao and Kronecker products, to mathematically integrate different orders of affinity into a unified tensor clustering framework. Thus, the UTC jointly learns a joint low-dimensional embedding to combine various orders. Finally, a numerical scheme is designed to solve the problem. Experiments on synthetic datasets and real-world datasets demonstrate that 1) the usage of high-order tensor affinity could provide a supplementary characterization of sample proximity to the popular affinity matrix; 2) the proposed method of UTC is affirmed to enhance clustering by exploiting different order affinities when processing high-dimensional data.  ( 2 min )
    DynaMIX: Resource Optimization for DNN-Based Real-Time Applications on a Multi-Tasking System. (arXiv:2302.01568v1 [cs.LG])
    As deep neural networks (DNNs) prove their importance and feasibility, more and more DNN-based apps, such as detection and classification of objects, have been developed and deployed on autonomous vehicles (AVs). To meet their growing expectations and requirements, AVs should "optimize" use of their limited onboard computing resources for multiple concurrent in-vehicle apps while satisfying their timing requirements (especially for safety). That is, real-time AV apps should share the limited on-board resources with other concurrent apps without missing their deadlines dictated by the frame rate of a camera that generates and provides input images to the apps. However, most, if not all, of existing DNN solutions focus on enhancing the concurrency of their specific hardware without dynamically optimizing/modifying the DNN apps' resource requirements, subject to the number of running apps, owing to their high computational cost. To mitigate this limitation, we propose DynaMIX (Dynamic MIXed-precision model construction), which optimizes the resource requirement of concurrent apps and aims to maximize execution accuracy. To realize a real-time resource optimization, we formulate an optimization problem using app performance profiles to consider both the accuracy and worst-case latency of each app. We also propose dynamic model reconfiguration by lazy loading only the selected layers at runtime to reduce the overhead of loading the entire model. DynaMIX is evaluated in terms of constraint satisfaction and inference accuracy for a multi-tasking system and compared against state-of-the-art solutions, demonstrating its effectiveness and feasibility under various environmental/operating conditions.  ( 2 min )
    Multiplier Bootstrap-based Exploration. (arXiv:2302.01543v1 [cs.LG])
    Despite the great interest in the bandit problem, designing efficient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. In this paper, we propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real data experiments, we show the generality and adaptivity of MBE.  ( 2 min )
    Example-Based Explainable AI and its Application for Remote Sensing Image Classification. (arXiv:2302.01526v1 [cs.AI])
    We present a method of explainable artificial intelligence (XAI), "What I Know (WIK)", to provide additional information to verify the reliability of a deep learning model by showing an example of an instance in a training dataset that is similar to the input data to be inferred and demonstrate it in a remote sensing image classification task. One of the expected roles of XAI methods is verifying whether inferences of a trained machine learning model are valid for an application, and it is an important factor that what datasets are used for training the model as well as the model architecture. Our data-centric approach can help determine whether the training dataset is sufficient for each inference by checking the selected example data. If the selected example looks similar to the input data, we can confirm that the model was not trained on a dataset with a feature distribution far from the feature of the input data. With this method, the criteria for selecting an example are not merely data similarity with the input data but also data similarity in the context of the model task. Using a remote sensing image dataset from the Sentinel-2 satellite, the concept was successfully demonstrated with reasonably selected examples. This method can be applied to various machine-learning tasks, including classification and regression.  ( 2 min )
    DCM: Deep energy method based on the principle of minimum complementary energy. (arXiv:2302.01538v1 [cs.LG])
    The principle of minimum potential and complementary energy are the most important variational principles in solid mechanics. The deep energy method (DEM), which has received much attention, is based on the principle of minimum potential energy and lacks the important form of minimum complementary energy. Thus, we propose the deep energy method based on the principle of minimum complementary energy (DCM). The output function of DCM is the stress function that naturally satisfies the equilibrium equation. We extend the proposed DCM algorithm (DCM-P), adding the terms that naturally satisfy the biharmonic equation in the Airy stress function. We combine operator learning with physical equations and propose a deep complementary energy operator method (DCM-O), including branch net, trunk net, basis net, and particular net. DCM-O first combines existing high-fidelity numerical results to train DCM-O through data. Then the complementary energy is used to train the branch net and trunk net in DCM-O. To analyze DCM performance, we present the numerical result of the most common stress functions, the Prandtl and Airy stress function. The proposed method DCM is used to model the representative mechanical problems with the different types of boundary conditions. We compare DCM with the existing PINNs and DEM algorithms. The result shows the advantage of the proposed DCM is suitable for dealing with problems of dominated displacement boundary conditions, which is reflected in theory and our numerical experiments. DCM-P and DCM-O improve the accuracy of DCM and the speed of calculation convergence. DCM is an essential supplementary energy form of the deep energy method. We believe that operator learning based on the energy method can balance data and physical equations well, giving computational mechanics broad research prospects.  ( 2 min )
    Robust Camera Pose Refinement for Multi-Resolution Hash Encoding. (arXiv:2302.01571v1 [cs.CV])
    Multi-resolution hash encoding has recently been proposed to reduce the computational cost of neural renderings, such as NeRF. This method requires accurate camera poses for the neural renderings of given scenes. However, contrary to previous methods jointly optimizing camera poses and 3D scenes, the naive gradient-based camera pose refinement method using multi-resolution hash encoding severely deteriorates performance. We propose a joint optimization algorithm to calibrate the camera pose and learn a geometric representation using efficient multi-resolution hash encoding. Showing that the oscillating gradient flows of hash encoding interfere with the registration of camera poses, our method addresses the issue by utilizing smooth interpolation weighting to stabilize the gradient oscillation for the ray samplings across hash grids. Moreover, the curriculum training procedure helps to learn the level-wise hash encoding, further increasing the pose refinement. Experiments on the novel-view synthesis datasets validate that our learning frameworks achieve state-of-the-art performance and rapid convergence of neural rendering, even when initial camera poses are unknown.  ( 2 min )
    Machine Learning for UAV Propeller Fault Detection based on a Hybrid Data Generation Model. (arXiv:2302.01556v1 [cs.LG])
    This paper describes the development of an on-board data-driven system that can monitor and localize the fault in a quadrotor unmanned aerial vehicle (UAV) and at the same time, evaluate the degree of damage of the fault under real scenarios. To achieve offline training data generation, a hybrid approach is proposed for the development of a virtual data-generative model using a combination of data-driven models as well as well-established dynamic models that describe the kinematics of the UAV. To effectively represent the drop in performance of a faulty propeller, a variation of the deep neural network, a LSTM network is proposed. With the RPM of the propeller as input and based on the fault condition of the propeller, the proposed propeller model estimates the resultant torque and thrust. Then, flight datasets of the UAV under various fault scenarios are generated via simulation using the developed data-generative model. Lastly, a fault classifier using a CNN model is proposed to identify as well as evaluate the degree of damage to the damaged propeller. The scope of this paper focuses on the identification of faulty propellers and classification of the fault level for quadrotor UAVs using RPM as well as flight data. Doing so allows for early minor fault detection to prevent serious faults from occurring if the fault is left unrepaired. To further validate the workability of this approach outside of simulation, a real-flight test is conducted indoors. The real flight data is collected and a simulation to real sim-real test is conducted. Due to the imperfections in the build of our experimental UAV, a slight calibration approach to our simulation model is further proposed and the experimental results obtained show that our trained model can identify the location of propeller fault as well as the degree/type of damage. Currently, the diagnosis accuracy on the testing set is over 80%.  ( 3 min )
    Deep Reinforcement Learning for Online Error Detection in Cyber-Physical Systems. (arXiv:2302.01567v1 [cs.LG])
    Reliability is one of the major design criteria in Cyber-Physical Systems (CPSs). This is because of the existence of some critical applications in CPSs and their failure is catastrophic. Therefore, employing strong error detection and correction mechanisms in CPSs is inevitable. CPSs are composed of a variety of units, including sensors, networks, and microcontrollers. Each of these units is probable to be in a faulty state at any time and the occurred fault can result in erroneous output. The fault may cause the units of CPS to malfunction and eventually crash. Traditional fault-tolerant approaches include redundancy time, hardware, information, and/or software. However, these approaches impose significant overheads besides their low error coverage, which limits their applicability. In addition, the interval between error occurrence and detection is too long in these approaches. In this paper, based on Deep Reinforcement Learning (DRL), a new error detection approach is proposed that not only detects errors with high accuracy but also can perform error detection at the moment due to very low inference time. The proposed approach can categorize different types of errors from normal data and predict whether the system will fail. The evaluation results illustrate that the proposed approach has improved more than 2x in terms of accuracy and more than 5x in terms of inference time compared to other approaches.  ( 2 min )
    Ordered GNN: Ordering Message Passing to Deal with Heterophily and Over-smoothing. (arXiv:2302.01524v1 [cs.LG])
    Most graph neural networks follow the message passing mechanism. However, it faces the over-smoothing problem when multiple times of message passing is applied to a graph, causing indistinguishable node representations and prevents the model to effectively learn dependencies between farther-away nodes. On the other hand, features of neighboring nodes with different labels are likely to be falsely mixed, resulting in the heterophily problem. In this work, we propose to order the messages passing into the node representation, with specific blocks of neurons targeted for message passing within specific hops. This is achieved by aligning the hierarchy of the rooted-tree of a central node with the ordered neurons in its node representation. Experimental results on an extensive set of datasets show that our model can simultaneously achieve the state-of-the-art in both homophily and heterophily settings, without any targeted design. Moreover, its performance maintains pretty well while the model becomes really deep, effectively preventing the over-smoothing problem. Finally, visualizing the gating vectors shows that our model learns to behave differently between homophily and heterophily settings, providing an explainable graph neural model.  ( 2 min )
    Causal Inference Based Single-branch Ensemble Trees For Uplift Modeling. (arXiv:2302.01563v1 [cs.LG])
    In this manuscript (ms), we propose causal inference based single-branch ensemble trees for uplift modeling, namely CIET. Different from standard classification methods for predictive probability modeling, CIET aims to achieve the change in the predictive probability of outcome caused by an action or a treatment. According to our CIET, two partition criteria are specifically designed to maximize the difference in outcome distribution between the treatment and control groups. Next, a novel single-branch tree is built by taking a top-down node partition approach, and the remaining samples are censored since they are not covered by the upper node partition logic. Repeating the tree-building process on the censored data, single-branch ensemble trees with a set of inference rules are thus formed. Moreover, CIET is experimentally demonstrated to outperform previous approaches for uplift modeling in terms of both area under uplift curve (AUUC) and Qini coefficient significantly. At present, CIET has already been applied to online personal loans in a national financial holdings group in China. CIET will also be of use to analysts applying machine learning techniques to causal inference in broader business domains such as web advertising, medicine and economics.  ( 2 min )
    Group Fairness in Non-monotone Submodular Maximization. (arXiv:2302.01546v1 [cs.LG])
    Maximizing a submodular function has a wide range of applications in machine learning and data mining. One such application is data summarization whose goal is to select a small set of representative and diverse data items from a large dataset. However, data items might have sensitive attributes such as race or gender, in this setting, it is important to design \emph{fairness-aware} algorithms to mitigate potential algorithmic bias that may cause over- or under- representation of particular groups. Motivated by that, we propose and study the classic non-monotone submodular maximization problem subject to novel group fairness constraints. Our goal is to select a set of items that maximizes a non-monotone submodular function, while ensuring that the number of selected items from each group is proportionate to its size, to the extent specified by the decision maker. We develop the first constant-factor approximation algorithms for this problem. We also extend the basic model to incorporate an additional global size constraint on the total number of selected items.  ( 2 min )
    Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits. (arXiv:2302.01544v1 [cs.LG])
    In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.  ( 2 min )
    Vertical Federated Learning: Taxonomies, Threats, and Prospects. (arXiv:2302.01550v1 [cs.LG])
    Federated learning (FL) is the most popular distributed machine learning technique. FL allows machine-learning models to be trained without acquiring raw data to a single point for processing. Instead, local models are trained with local data; the models are then shared and combined. This approach preserves data privacy as locally trained models are shared instead of the raw data themselves. Broadly, FL can be divided into horizontal federated learning (HFL) and vertical federated learning (VFL). For the former, different parties hold different samples over the same set of features; for the latter, different parties hold different feature data belonging to the same set of samples. In a number of practical scenarios, VFL is more relevant than HFL as different companies (e.g., bank and retailer) hold different features (e.g., credit history and shopping history) for the same set of customers. Although VFL is an emerging area of research, it is not well-established compared to HFL. Besides, VFL-related studies are dispersed, and their connections are not intuitive. Thus, this survey aims to bring these VFL-related studies to one place. Firstly, we classify existing VFL structures and algorithms. Secondly, we present the threats from security and privacy perspectives to VFL. Thirdly, for the benefit of future researchers, we discussed the challenges and prospects of VFL in detail.  ( 2 min )
    Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective. (arXiv:2302.01530v1 [cs.CL])
    Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets demonstrate that our proposed ILD method outperforms other KD techniques. Our code is available at https://github.com/jongwooko/CR-ILD.  ( 2 min )
    Using natural language processing and structured medical data to phenotype patients hospitalized due to COVID-19. (arXiv:2302.01536v1 [cs.CL])
    To identify patients who are hospitalized because of COVID-19 as opposed to those who were admitted for other indications, we compared the performance of different computable phenotype definitions for COVID-19 hospitalizations that use different types of data from the electronic health records (EHR), including structured EHR data elements, provider notes, or a combination of both data types. And conduct a retrospective data analysis utilizing chart review-based validation. Participants are 586 hospitalized individuals who tested positive for SARS-CoV-2 during January 2022. We used natural language processing to incorporate data from provider notes and LASSO regression and Random Forests to fit classification algorithms that incorporated structured EHR data elements, provider notes, or a combination of structured data and provider notes. Results: Based on a chart review, 38% of 586 patients were determined to be hospitalized for reasons other than COVID-19 despite having tested positive for SARS-CoV-2. A classification algorithm that used provider notes had significantly better discrimination than one that used structured EHR data elements (AUROC: 0.894 vs 0.841, p < 0.001), and performed similarly to a model that combined provider notes with structured data elements (AUROC: 0.894 vs 0.893). Assessments of hospital outcome metrics significantly differed based on whether the population included all hospitalized patients who tested positive for SARS-CoV-2 versus those who were determined to have been hospitalized due to COVID-19. This work demonstrates the utility of natural language processing approaches to derive information related to patient hospitalizations in cases where there may be multiple conditions that could serve as the primary indication for hospitalization.  ( 3 min )
    Multi-channel Autobidding with Budget and ROI Constraints. (arXiv:2302.01523v1 [cs.GT])
    In digital online advertising, advertisers procure ad impressions simultaneously on multiple platforms, or so-called channels, such as Google Ads, Meta Ads Manager, etc., each of which consists of numerous ad auctions. We study how an advertiser maximizes total conversion (e.g. ad clicks) while satisfying aggregate return-on-investment (ROI) and budget constraints across all channels. In practice, an advertiser does not have control over, and thus cannot globally optimize, which individual ad auctions she participates in for each channel, and instead authorizes a channel to procure impressions on her behalf: the advertiser can only utilize two levers on each channel, namely setting a per-channel budget and per-channel target ROI. In this work, we first analyze the effectiveness of each of these levers for solving the advertiser's global multi-channel problem. We show that when an advertiser only optimizes over per-channel ROIs, her total conversion can be arbitrarily worse than what she could have obtained in the global problem. Further, we show that the advertiser can achieve the global optimal conversion when she only optimizes over per-channel budgets. In light of this finding, under a bandit feedback setting that mimics real-world scenarios where advertisers have limited information on ad auctions in each channels and how channels procure ads, we present an efficient learning algorithm that produces per-channel budgets whose resulting conversion approximates that of the global optimal problem. Finally, we argue that all our results hold for both single-item and multi-item auctions from which channels procure impressions on advertisers' behalf.  ( 2 min )
    Pseudonorm Approachability and Applications to Regret Minimization. (arXiv:2302.01517v1 [cs.LG])
    Blackwell's celebrated approachability theory provides a general framework for a variety of learning problems, including regret minimization. However, Blackwell's proof and implicit algorithm measure approachability using the $\ell_2$ (Euclidean) distance. We argue that in many applications such as regret minimization, it is more useful to study approachability under other distance metrics, most commonly the $\ell_\infty$-metric. But, the time and space complexity of the algorithms designed for $\ell_\infty$-approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. Thus, we present a framework for converting high-dimensional $\ell_\infty$-approachability problems to low-dimensional pseudonorm approachability problems, thereby resolving such issues. We first show that the $\ell_\infty$-distance between the average payoff and the approachability set can be equivalently defined as a pseudodistance between a lower-dimensional average vector payoff and a new convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for $\ell_2$ and other norms, showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an $\ell_\infty$-approachability algorithm whose convergence is independent of the dimension of the original vectorial payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original $\ell_\infty$-distance can be computed efficiently. We also give an $\ell_\infty$-approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer.  ( 2 min )
    Improving Recommendation Relevance by simulating User Interest. (arXiv:2302.01522v1 [math.NA])
    Most if not all on-line item-to-item recommendation systems rely on estimation of a distance like measure (rank) of similarity between items. For on-line recommendation systems, time sensitivity of this similarity measure is extremely important. We observe that recommendation "recency" can be straightforwardly and transparently maintained by iterative reduction of ranks of inactive items. The paper briefly summarizes algorithmic developments based on this self-explanatory observation. The basic idea behind this work is patented in a context of online recommendation systems.  ( 2 min )
    Randomized Gaussian Process Upper Confidence Bound with Tight Bayesian Regret Bounds. (arXiv:2302.01511v1 [cs.LG])
    Gaussian process upper confidence bound (GP-UCB) is a theoretically promising approach for black-box optimization; however, the confidence parameter $\beta$ is considerably large in the theorem and chosen heuristically in practice. Then, randomized GP-UCB (RGP-UCB) uses a randomized confidence parameter, which follows the Gamma distribution, to mitigate the impact of manually specifying $\beta$. This study first generalizes the regret analysis of RGP-UCB to a wider class of distributions, including the Gamma distribution. Furthermore, we propose improved RGP-UCB (IRGP-UCB) based on a two-parameter exponential distribution, which achieves tight Bayesian regret bounds. IRGP-UCB does not require an increase in the confidence parameter in terms of the number of iterations, which avoids over-exploration in the later iterations. Finally, we demonstrate the effectiveness of IRGP-UCB through extensive experiments.  ( 2 min )
    A Lipschitz Bandits Approach for Continuous Hyperparameter Optimization. (arXiv:2302.01539v1 [cs.LG])
    One of the most critical problems in machine learning is HyperParameter Optimization (HPO), since choice of hyperparameters has a significant impact on final model performance. Although there are many HPO algorithms, they either have no theoretical guarantees or require strong assumptions. To this end, we introduce BLiE -- a Lipschitz-bandit-based algorithm for HPO that only assumes Lipschitz continuity of the objective function. BLiE exploits the landscape of the objective function to adaptively search over the hyperparameter space. Theoretically, we show that $(i)$ BLiE finds an $\epsilon$-optimal hyperparameter with $O \left( \frac{1}{\epsilon} \right)^{d_z + \beta}$ total budgets, where $d_z$ and $\beta$ are problem intrinsic; $(ii)$ BLiE is highly parallelizable. Empirically, we demonstrate that BLiE outperforms the state-of-the-art HPO algorithms on benchmark tasks. We also apply BLiE to search for noise schedule of diffusion models. Comparison with the default schedule shows that BLiE schedule greatly improves the sampling speed.  ( 2 min )
    Support Recovery in Sparse PCA with Non-Random Missing Data. (arXiv:2302.01535v1 [stat.ML])
    We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems.  ( 2 min )
    ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics. (arXiv:2302.01501v1 [cs.IR])
    As the amount of text data generated by humans and machines increases, the necessity of understanding large corpora and finding a way to extract insights from them is becoming more crucial than ever. Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. In this paper, we introduce the Aligned Neural Topic Model (ANTM), a dynamic neural topic model that uses document embeddings to compute clusters of semantically similar documents at different periods and to align document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).  ( 2 min )
    LSA-PINN: Linear Boundary Connectivity Loss for Solving PDEs on Complex Geometry. (arXiv:2302.01518v1 [cs.LG])
    We present a novel loss formulation for efficient learning of complex dynamics from governing physics, typically described by partial differential equations (PDEs), using physics-informed neural networks (PINNs). In our experiments, existing versions of PINNs are seen to learn poorly in many problems, especially for complex geometries, as it becomes increasingly difficult to establish appropriate sampling strategy at the near boundary region. Overly dense sampling can adversely impede training convergence if the local gradient behaviors are too complex to be adequately modelled by PINNs. On the other hand, if the samples are too sparse, existing PINNs tend to overfit the near boundary region, leading to incorrect solution. To prevent such issues, we propose a new Boundary Connectivity (BCXN) loss function which provides linear local structure approximation (LSA) to the gradient behaviors at the boundary for PINN. Our BCXN-loss implicitly imposes local structure during training, thus facilitating fast physics-informed learning across entire problem domains with order of magnitude sparser training samples. This LSA-PINN method shows a few orders of magnitude smaller errors than existing methods in terms of the standard L2-norm metric, while using dramatically fewer training samples and iterations. Our proposed LSA-PINN does not pose any requirement on the differentiable property of the networks, and we demonstrate its benefits and ease of implementation on both multi-layer perceptron and convolutional neural network versions as commonly used in current PINN literature.  ( 2 min )
    Xtal2DoS: Attention-based Crystal to Sequence Learning for Density of States Prediction. (arXiv:2302.01486v1 [cs.LG])
    Modern machine learning techniques have been extensively applied to materials science, especially for property prediction tasks. A majority of these methods address scalar property predictions, while more challenging spectral properties remain less emphasized. We formulate a crystal-to-sequence learning task and propose a novel attention-based learning method, Xtal2DoS, which decodes the sequential representation of the material density of states (DoS) properties by incorporating the learned atomic embeddings through attention networks. Experiments show Xtal2DoS is faster than the existing models, and consistently outperforms other state-of-the-art methods on four metrics for two fundamental spectral properties, phonon and electronic DoS.  ( 2 min )
    User-centric Heterogeneous-action Deep Reinforcement Learning for Virtual Reality in the Metaverse over Wireless Networks. (arXiv:2302.01471v1 [cs.NI])
    The Metaverse is emerging as maturing technologies are empowering the different facets. Virtual Reality (VR) technologies serve as the backbone of the virtual universe within the Metaverse to offer a highly immersive user experience. As mobility is emphasized in the Metaverse context, VR devices reduce their weights at the sacrifice of local computation abilities. In this paper, for a system consisting of a Metaverse server and multiple VR users, we consider two cases of (i) the server generating frames and transmitting them to users, and (ii) users generating frames locally and thus consuming device energy. Moreover, in our multi-user VR scenario for the Metaverse, users have different characteristics and demands for Frames Per Second (FPS). Then the channel access arrangement (including the decisions on frame generation location), and transmission powers for the downlink communications from the server to the users are jointly optimized to improve the utilities of users. This joint optimization is addressed by deep reinforcement learning (DRL) with heterogeneous actions. Our proposed user-centric DRL algorithm is called User-centric Critic with Heterogenous Actors (UCHA). Extensive experiments demonstrate that our UCHA algorithm leads to remarkable results under various requirements and constraints.  ( 2 min )
    Clustered Embedding Learning for Recommender Systems. (arXiv:2302.01478v1 [cs.AI])
    In recent years, recommender systems have advanced rapidly, where embedding learning for users and items plays a critical role. A standard method learns a unique embedding vector for each user and item. However, such a method has two important limitations in real-world applications: 1) it is hard to learn embeddings that generalize well for users and items with rare interactions on their own; and 2) it may incur unbearably high memory costs when the number of users and items scales up. Existing approaches either can only address one of the limitations or have flawed overall performances. In this paper, we propose Clustered Embedding Learning (CEL) as an integrated solution to these two problems. CEL is a plug-and-play embedding learning framework that can be combined with any differentiable feature interaction model. It is capable of achieving improved performance, especially for cold users and items, with reduced memory cost. CEL enables automatic and dynamic clustering of users and items in a top-down fashion, where clustered entities jointly learn a shared embedding. The accelerated version of CEL has an optimal time complexity, which supports efficient online updates. Theoretically, we prove the identifiability and the existence of a unique optimal number of clusters for CEL in the context of nonnegative matrix factorization. Empirically, we validate the effectiveness of CEL on three public datasets and one business dataset, showing its consistently superior performance against current state-of-the-art methods. In particular, when incorporating CEL into the business model, it brings an improvement of $+0.6\%$ in AUC, which translates into a significant revenue gain; meanwhile, the size of the embedding table gets $2650$ times smaller.  ( 2 min )
    Learning to Optimize for Reinforcement Learning. (arXiv:2302.01470v1 [cs.LG])
    In recent years, by leveraging more data, computation, and diverse tasks, learned optimizers have achieved remarkable success in supervised learning optimization, outperforming classical hand-designed optimizers. However, in practice, these learned optimizers fail to generalize to reinforcement learning tasks due to unstable and complex loss landscapes. Moreover, neither hand-designed optimizers nor learned optimizers have been specifically designed to address the unique optimization properties in reinforcement learning. In this work, we take a data-driven approach to learn to optimize for reinforcement learning using meta-learning. We introduce a novel optimizer structure that significantly improves the training efficiency of learned optimizers, making it possible to learn an optimizer for reinforcement learning from scratch. Although trained in toy tasks, our learned optimizer demonstrates its generalization ability to unseen complex tasks. Finally, we design a set of small gridworlds to train the first general-purpose optimizer for reinforcement learning.  ( 2 min )
    Towards Practical Preferential Bayesian Optimization with Skew Gaussian Processes. (arXiv:2302.01513v1 [cs.LG])
    We study preferential Bayesian optimization (BO) where reliable feedback is limited to pairwise comparison called duels. An important challenge in preferential BO, which uses the preferential Gaussian process (GP) model to represent flexible preference structure, is that the posterior distribution is a computationally intractable skew GP. The most widely used approach for preferential BO is Gaussian approximation, which ignores the skewness of the true posterior. Alternatively, Markov chain Monte Carlo (MCMC) based preferential BO is also proposed. In this work, we first verify the accuracy of Gaussian approximation, from which we reveal the critical problem that the predictive probability of duels can be inaccurate. This observation motivates us to improve the MCMC-based estimation for skew GP, for which we show the practical efficiency of Gibbs sampling and derive the low variance MC estimator. However, the computational time of MCMC can still be a bottleneck in practice. Towards building a more practical preferential BO, we develop a new method that achieves both high computational efficiency and low sample complexity, and then demonstrate its effectiveness through extensive numerical experiments.  ( 2 min )
    Perfect Is the Enemy of Test Oracle. (arXiv:2302.01488v1 [cs.SE])
    Automation of test oracles is one of the most challenging facets of software testing, but remains comparatively less addressed compared to automated test input generation. Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes. What makes the oracle problem challenging and undecidable is the assumption that the ground-truth should know the exact expected, correct, or buggy behavior. However, we argue that one can still build an accurate oracle without knowing the exact correct or buggy behavior, but how these two might differ. This paper presents SEER, a learning-based approach that in the absence of test assertions or other types of oracle, can determine whether a unit test passes or fails on a given method under test (MUT). To build the ground-truth, SEER jointly embeds unit tests and the implementation of MUTs into a unified vector space, in such a way that the neural representation of tests are similar to that of MUTs they pass on them, but dissimilar to MUTs they fail on them. The classifier built on top of this vector representation serves as the oracle to generate "fail" labels, when test inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is (1) effective in predicting the fail or pass labels, achieving an overall accuracy, precision, recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting the labels for the unit test of projects that were not in training or validation set with negligible performance drop, and (3) efficient, detecting the existence of bugs in only 6.5 milliseconds on average.  ( 3 min )
    LazyGNN: Large-Scale Graph Neural Networks via Lazy Propagation. (arXiv:2302.01503v1 [cs.LG])
    Recent works have demonstrated the benefits of capturing long-distance dependency in graphs by deeper graph neural networks (GNNs). But deeper GNNs suffer from the long-lasting scalability challenge due to the neighborhood explosion problem in large-scale graphs. In this work, we propose to capture long-distance dependency in graphs by shallower models instead of deeper models, which leads to a much more efficient model, LazyGNN, for graph representation learning. Moreover, we demonstrate that LazyGNN is compatible with existing scalable approaches (such as sampling methods) for further accelerations through the development of mini-batch LazyGNN. Comprehensive experiments demonstrate its superior prediction performance and scalability on large-scale benchmarks. LazyGNN also achieves state-of-art performance on the OGB leaderboard.  ( 2 min )
    Defensive ML: Defending Architectural Side-channels with Adversarial Obfuscation. (arXiv:2302.01474v1 [cs.CR])
    Side-channel attacks that use machine learning (ML) for signal analysis have become prominent threats to computer security, as ML models easily find patterns in signals. To address this problem, this paper explores using Adversarial Machine Learning (AML) methods as a defense at the computer architecture layer to obfuscate side channels. We call this approach Defensive ML, and the generator to obfuscate signals, defender. Defensive ML is a workflow to design, implement, train, and deploy defenders for different environments. First, we design a defender architecture given the physical characteristics and hardware constraints of the side-channel. Next, we use our DefenderGAN structure to train the defender. Finally, we apply defensive ML to thwart two side-channel attacks: one based on memory contention and the other on application power. The former uses a hardware defender with ns-level response time that attains a high level of security with half the performance impact of a traditional scheme; the latter uses a software defender with ms-level response time that provides better security than a traditional scheme with only 70% of its power overhead.  ( 2 min )
    Gradient Estimation for Unseen Domain Risk Minimization with Pre-Trained Models. (arXiv:2302.01497v1 [cs.LG])
    Domain generalization aims to build generalized models that perform well on unseen domains when only source domains are available for model optimization. Recent studies have demonstrated that large-scale pre-trained models could play an important role in domain generalization by providing their generalization power. However, large-scale pre-trained models are not fully equipped with target task-specific knowledge due to a discrepancy between the pre-training objective and the target task. Although the task-specific knowledge could be learned from source domains by fine-tuning, this hurts the generalization power of the pre-trained models because of gradient bias toward the source domains. To address this issue, we propose a new domain generalization method that estimates unobservable gradients that reduce potential risks in unseen domains, using a large-scale pre-trained model. Our proposed method allows the pre-trained model to learn task-specific knowledge further while preserving its generalization ability with the estimated gradients. Experimental results show that our proposed method outperforms baseline methods on DomainBed, a standard benchmark in domain generalization. We also provide extensive analyses to demonstrate that the estimated unobserved gradients relieve the gradient bias, and the pre-trained model learns the task-specific knowledge without sacrificing its generalization power.  ( 2 min )
    SPADE: Self-supervised Pretraining for Acoustic DisEntanglement. (arXiv:2302.01483v1 [cs.LG])
    Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.  ( 2 min )
    Efficient Domain Adaptation for Speech Foundation Models. (arXiv:2302.01496v1 [cs.CL])
    Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.  ( 2 min )
    Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms. (arXiv:2302.01450v1 [cs.LG])
    Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, often discounted reward formulations are used with the discount factor being close to 1, which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first non-trivial error bounds for average-reward MDPs which go to zero in the limit where when policy evaluation and policy improvement errors go to zero.  ( 2 min )
    Spiking Synaptic Penalty: Appropriate Penalty Term for Energy-Efficient Spiking Neural Networks. (arXiv:2302.01500v1 [cs.LG])
    Spiking neural networks (SNNs) are energy-efficient neural networks because of their spiking nature. However, as the spike firing rate of SNNs increases, the energy consumption does as well, and thus, the advantage of SNNs diminishes. Here, we tackle this problem by introducing a novel penalty term for the spiking activity into the objective function in the training phase. Our method is designed so as to optimize the energy consumption metric directly without modifying the network architecture. Therefore, the proposed method can reduce the energy consumption more than other methods while maintaining the accuracy. We conducted experiments for image classification tasks, and the results indicate the effectiveness of the proposed method, which mitigates the dilemma of the energy--accuracy trade-off.  ( 2 min )
    Convergence of Gradient Descent with Linearly Correlated Noise and Applications to Differentially Private Learning. (arXiv:2302.01463v1 [cs.LG])
    We study stochastic optimization with linearly correlated noise. Our study is motivated by recent methods for optimization with differential privacy (DP), such as DP-FTRL, which inject noise via matrix factorization mechanisms. We propose an optimization problem that distils key facets of these DP methods and that involves perturbing gradients by linearly correlated noise. We derive improved convergence rates for gradient descent in this framework for convex and non-convex loss functions. Our theoretical analysis is novel and might be of independent interest. We use these convergence rates to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.  ( 2 min )
    Commonsense-Aware Prompting for Controllable Empathetic Dialogue Generation. (arXiv:2302.01441v1 [cs.CL])
    Improving the emotional awareness of pre-trained language models is an emerging important problem for dialogue generation tasks. Although prior studies have introduced methods to improve empathetic dialogue generation, few have discussed how to incorporate commonsense knowledge into pre-trained language models for controllable dialogue generation. In this study, we propose a novel framework that improves empathetic dialogue generation using pre-trained language models by 1) incorporating commonsense knowledge through prompt verbalization, and 2) controlling dialogue generation using a strategy-driven future discriminator. We conducted experiments to reveal that both the incorporation of social commonsense knowledge and enforcement of control over generation help to improve generation performance. Finally, we discuss the implications of our study for future research.  ( 2 min )
    Generalized Uncertainty of Deep Neural Networks: Taxonomy and Applications. (arXiv:2302.01440v1 [cs.LG])
    Deep neural networks have seen enormous success in various real-world applications. Beyond their predictions as point estimates, increasing attention has been focused on quantifying the uncertainty of their predictions. In this review, we show that the uncertainty of deep neural networks is not only important in a sense of interpretability and transparency, but also crucial in further advancing their performance, particularly in learning systems seeking robustness and efficiency. We will generalize the definition of the uncertainty of deep neural networks to any number or vector that is associated with an input or an input-label pair, and catalog existing methods on ``mining'' such uncertainty from a deep model. We will include those methods from the classic field of uncertainty quantification as well as those methods that are specific to deep neural networks. We then show a wide spectrum of applications of such generalized uncertainty in realistic learning tasks including robust learning such as noisy learning, adversarially robust learning; data-efficient learning such as semi-supervised and weakly-supervised learning; and model-efficient learning such as model compression and knowledge distillation.  ( 2 min )
    Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search. (arXiv:2302.01382v1 [cs.LG])
    Serving large-scale machine learning (ML) models efficiently and with low latency has become challenging owing to increasing model size and complexity. Quantizing models can simultaneously reduce memory and compute requirements, facilitating their widespread access. However, for large models not all layers are equally amenable to the same numerical precision and aggressive quantization can lead to unacceptable loss in model accuracy. One approach to prevent this accuracy degradation is mixed-precision quantization, which allows different tensors to be quantized to varying levels of numerical precision, leveraging the capabilities of modern hardware. Such mixed-precision quantiztaion can more effectively allocate numerical precision to different tensors `as needed' to preserve model accuracy while reducing footprint and compute latency. In this paper, we propose a method to efficiently determine quantization configurations of different tensors in ML models using post-training mixed precision quantization. We analyze three sensitivity metrics and evaluate them for guiding configuration search of two algorithms. We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31% compared to the baseline 16-bit floating point model while guaranteeing no more than 1% accuracy degradation.  ( 2 min )
    A Reduction-based Framework for Sequential Decision Making with Delayed Feedback. (arXiv:2302.01477v1 [cs.LG])
    We study stochastic delayed feedback in general multi-agent sequential decision making, which includes bandits, single-agent Markov decision processes (MDPs), and Markov games (MGs). We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that can handle stochastic delays in sequential decision making. By plugging different multi-batched algorithms into our framework, we provide several examples demonstrating that our framework not only matches or improves existing results for bandits, tabular MDPs, and tabular MGs, but also provides the first line of studies on delays in sequential decision making with function approximation. In summary, we provide a complete set of sharp results for multi-agent sequential decision making with delayed feedback.  ( 2 min )
    Out of Context: Investigating the Bias and Fairness Concerns of "Artificial Intelligence as a Service". (arXiv:2302.01448v1 [cs.LG])
    "AI as a Service" (AIaaS) is a rapidly growing market, offering various plug-and-play AI services and tools. AIaaS enables its customers (users) - who may lack the expertise, data, and/or resources to develop their own systems - to easily build and integrate AI capabilities into their applications. Yet, it is known that AI systems can encapsulate biases and inequalities that can have societal impact. This paper argues that the context-sensitive nature of fairness is often incompatible with AIaaS' 'one-size-fits-all' approach, leading to issues and tensions. Specifically, we review and systematise the AIaaS space by proposing a taxonomy of AI services based on the levels of autonomy afforded to the user. We then critically examine the different categories of AIaaS, outlining how these services can lead to biases or be otherwise harmful in the context of end-user applications. In doing so, we seek to draw research attention to the challenges of this emerging area.  ( 2 min )
    Continual Learning with Scaled Gradient Projection. (arXiv:2302.01386v1 [cs.LG])
    In neural networks, continual learning results in gradient interference among sequential tasks, leading to catastrophic forgetting of old tasks while learning new ones. This issue is addressed in recent methods by storing the important gradient spaces for old tasks and updating the model orthogonally during new tasks. However, such restrictive orthogonal gradient updates hamper the learning capability of the new tasks resulting in sub-optimal performance. To improve new learning while minimizing forgetting, in this paper we propose a Scaled Gradient Projection (SGP) method, where we combine the orthogonal gradient projections with scaled gradient steps along the important gradient spaces for the past tasks. The degree of gradient scaling along these spaces depends on the importance of the bases spanning them. We propose an efficient method for computing and accumulating importance of these bases using the singular value decomposition of the input representations for each task. We conduct extensive experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.  ( 2 min )
    Effective Robustness against Natural Distribution Shifts for Models with Different Training Data. (arXiv:2302.01381v1 [cs.LG])
    ``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new effective robustness evaluation metric to compare the effective robustness of models trained on different data distributions. To do this we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of the effectiveness robustness and explains the surprising effective robustness gains of zero-shot CLIP-like models exhibited when considering only one ID dataset, while the gains diminish under our evaluation.  ( 2 min )
    Hyper-parameter Tuning for Fair Classification without Sensitive Attribute Access. (arXiv:2302.01385v1 [cs.LG])
    Fair machine learning methods seek to train models that balance model performance across demographic subgroups defined over sensitive attributes like race and gender. Although sensitive attributes are typically assumed to be known during training, they may not be available in practice due to privacy and other logistical concerns. Recent work has sought to train fair models without sensitive attributes on training data. However, these methods need extensive hyper-parameter tuning to achieve good results, and hence assume that sensitive attributes are known on validation data. However, this assumption too might not be practical. Here, we propose Antigone, a framework to train fair classifiers without access to sensitive attributes on either training or validation data. Instead, we generate pseudo sensitive attributes on the validation data by training a biased classifier and using the classifier's incorrectly (correctly) labeled examples as proxies for minority (majority) groups. Since fairness metrics like demographic parity, equal opportunity and subgroup accuracy can be estimated to within a proportionality constant even with noisy sensitive attribute information, we show theoretically and empirically that these proxy labels can be used to maximize fairness under average accuracy constraints. Key to our results is a principled approach to select the hyper-parameters of the biased classifier in a completely unsupervised fashion (meaning without access to ground truth sensitive attributes) that minimizes the gap between fairness estimated using noisy versus ground-truth sensitive labels.  ( 2 min )
    Neural Insights for Digital Marketing Content Design. (arXiv:2302.01416v1 [cs.LG])
    In digital marketing, experimenting with new website content is one of the key levers to improve customer engagement. However, creating successful marketing content is a manual and time-consuming process that lacks clear guiding principles. This paper seeks to close the loop between content creation and online experimentation by offering marketers AI-driven actionable insights based on historical data to improve their creative process. We present a neural-network-based system that scores and extracts insights from a marketing content design, namely, a multimodal neural network predicts the attractiveness of marketing contents, and a post-hoc attribution method generates actionable insights for marketers to improve their content in specific marketing locations. Our insights not only point out the advantages and drawbacks of a given current content, but also provide design recommendations based on historical data. We show that our scoring model and insights work well both quantitatively and qualitatively.  ( 2 min )
    Provably Bounding Neural Network Preimages. (arXiv:2302.01404v1 [cs.LG])
    Most work on the formal verification of neural networks has focused on bounding forward images of neural networks, i.e., the set of outputs of a neural network that correspond to a given set of inputs (for example, bounded perturbations of a nominal input). However, many use cases of neural network verification require solving the inverse problem, i.e, over-approximating the set of inputs that lead to certain outputs. In this work, we present the first efficient bound propagation algorithm, INVPROP, for verifying properties over the preimage of a linearly constrained output set of a neural network, which can be combined with branch-and-bound to achieve completeness. Our efficient algorithm allows multiple passes of intermediate bound refinements, which are crucial for tight inverse verification because the bounds of an intermediate layer depend on relaxations both before and after this layer. We demonstrate our algorithm on applications related to quantifying safe control regions for a dynamical system and detecting out-of-distribution inputs to a neural network. Our results show that in certain settings, we can find over-approximations that are over 2500 times tighter than prior work while being 2.5 times faster on the same hardware.  ( 2 min )
    A Convolutional-based Model for Early Prediction of Alzheimer's based on the Dementia Stage in the MRI Brain Images. (arXiv:2302.01417v1 [cs.LG])
    Alzheimer's disease is a degenerative brain disease. Being the primary cause of Dementia in adults and progressively destroys brain memory. Though Alzheimer's disease does not have a cure currently, diagnosing it at an earlier stage will help reduce the severity of the disease. Thus, early diagnosis of Alzheimer's could help to reduce or stop the disease from progressing. In this paper, we proposed a deep convolutional neural network-based model for learning model using to determine the stage of Dementia in adults based on the Magnetic Resonance Imaging (MRI) images to detect the early onset of Alzheimer's.  ( 2 min )
    Hyperbolic Contrastive Learning. (arXiv:2302.01409v1 [cs.CV])
    Learning good image representations that are beneficial to downstream tasks is a challenging task in computer vision. As such, a wide variety of self-supervised learning approaches have been proposed. Among them, contrastive learning has shown competitive performance on several benchmark datasets. The embeddings of contrastive learning are arranged on a hypersphere that results in using the inner (dot) product as a distance measurement in Euclidean space. However, the underlying structure of many scientific fields like social networks, brain imaging, and computer graphics data exhibit highly non-Euclidean latent geometry. We propose a novel contrastive learning framework to learn semantic relationships in the hyperbolic space. Hyperbolic space is a continuous version of trees that naturally owns the ability to model hierarchical structures and is thus beneficial for efficient contrastive representation learning. We also extend the proposed Hyperbolic Contrastive Learning (HCL) to the supervised domain and studied the adversarial robustness of HCL. The comprehensive experiments show that our proposed method achieves better results on self-supervised pretraining, supervised classification, and higher robust accuracy than baseline methods.  ( 2 min )
    Dataset Distillation Fixes Dataset Reconstruction Attacks. (arXiv:2302.01428v1 [cs.LG])
    Modern deep learning requires large volumes of data, which could contain sensitive or private information which cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effectiveness regime, and ways to defend against it. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover its entire training set in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. More importantly, we formally show for the first time that dataset reconstruction attacks are a variation of dataset distillation. This key theoretical result on the unification of dataset reconstruction and distillation not only sheds more light on the characteristics of the attack but enables us to design defense mechanisms against them via distillation algorithms.  ( 2 min )
    On the Robustness of Randomized Ensembles to Adversarial Perturbations. (arXiv:2302.01375v1 [cs.LG])
    Randomized ensemble classifiers (RECs), where one classifier is randomly selected during inference, have emerged as an attractive alternative to traditional ensembling methods for realizing adversarially robust classifiers with limited compute requirements. However, recent works have shown that existing methods for constructing RECs are more vulnerable than initially claimed, casting major doubts on their efficacy and prompting fundamental questions such as: "When are RECs useful?", "What are their limits?", and "How do we train them?". In this work, we first demystify RECs as we derive fundamental results regarding their theoretical limits, necessary and sufficient conditions for them to be useful, and more. Leveraging this new understanding, we propose a new boosting algorithm (BARRE) for training robust RECs, and empirically demonstrate its effectiveness at defending against strong $\ell_\infty$ norm-bounded adversaries across various network architectures and datasets.  ( 2 min )
    Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective. (arXiv:2302.01425v1 [cs.LG])
    The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.  ( 2 min )
    Accelerating Policy Gradient by Estimating Value Function from Prior Computation in Deep Reinforcement Learning. (arXiv:2302.01399v1 [cs.LG])
    This paper investigates the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods in reinforcement learning. Our approach is to estimate the value function from prior computations, such as from the Q-network learned in DQN or the value function trained for different but related environments. In particular, we learn a new value function for the target task while combining it with a value estimate from the prior computation. Finally, the resulting value function is used as a baseline in the policy gradient method. This use of a baseline has the theoretical property of reducing variance in gradient computation and thus improving sample efficiency. The experiments show the successful use of prior value estimates in various settings and improved sample efficiency in several tasks.  ( 2 min )
    Personalized Understanding of Blood Glucose Dynamics via Mobile Sensor Data. (arXiv:2302.01400v1 [cs.HC])
    Continuous Blood Glucose (CGM) monitors have revolutionized the ability of diabetics to manage their blood glucose, and paved the way for artificial pancreas systems. In this paper we augment CGM data with sensor input collected by a smart phone and use it to provide analytical tools for patients and clinicians. We collected GPS data, activity classifications, and blood glucose data with a custom iOS application over a 9 month period from a single free-living type-1 diabetic patient. This data set is novel in terms of it's size, the inclusion of GPS data, and the fact that it was collected non-intrusively from a free-living patient. We describe a method to measure the occurrence of lifestyle \textit{events} based on GPS and activity data, and show that they can capture instances of food consumption and are therefore correlated to changes in blood glucose. Finally, we incorporate these event representations into our system to create useful visualizations and notifications to aid patients in managing their diabetes.  ( 2 min )
    Learning with Exposure Constraints in Recommendation Systems. (arXiv:2302.01377v1 [cs.LG])
    Recommendation systems are dynamic economic systems that balance the needs of multiple stakeholders. A recent line of work studies incentives from the content providers' point of view. Content providers, e.g., vloggers and bloggers, contribute fresh content and rely on user engagement to create revenue and finance their operations. In this work, we propose a contextual multi-armed bandit setting to model the dependency of content providers on exposure. In our model, the system receives a user context in every round and has to select one of the arms. Every arm is a content provider who must receive a minimum number of pulls every fixed time period (e.g., a month) to remain viable in later rounds; otherwise, the arm departs and is no longer available. The system aims to maximize the users' (content consumers) welfare. To that end, it should learn which arms are vital and ensure they remain viable by subsidizing arm pulls if needed. We develop algorithms with sub-linear regret, as well as a lower bound that demonstrates that our algorithms are optimal up to logarithmic factors.  ( 2 min )
    Neural Network Architecture for Database Augmentation Using Shared Features. (arXiv:2302.01374v1 [cs.LG])
    The popularity of learning from data with machine learning and neural networks has lead to the creation of many new datasets for almost every problem domain. However, even within a single domain, these datasets are often collected with disparate features, sampled from different sub-populations, and recorded at different time points. Even with the plethora of individual datasets, large data science projects can be difficult as it is often not trivial to merge these smaller datasets. Inherent challenges in some domains such as medicine also makes it very difficult to create large single source datasets or multi-source datasets with identical features. Instead of trying to merge these non-matching datasets directly, we propose a neural network architecture that can provide data augmentation using features common between these datasets. Our results show that this style of data augmentation can work for both image and tabular data.  ( 2 min )
    Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen's f2. (arXiv:2302.01407v1 [stat.ME])
    Deep artificial neural networks show high predictive performance in many fields, but they do not afford statistical inferences and their black-box operations are too complicated for humans to comprehend. Because positing that a relationship exists is often more important than prediction in scientific experiments and research models, machine learning is far less frequently used than inferential statistics. Additionally, statistics calls for improving the test of theory by showing the magnitude of the phenomena being studied. This article extends current XAI methods and develops a model agnostic hypothesis testing framework for machine learning. First, Fisher's variable permutation algorithm is tweaked to compute an effect size measure equivalent to Cohen's f2 for OLS regression models. Second, the Mann-Kendall test of monotonicity and the Theil-Sen estimator is applied to Apley's accumulated local effect plots to specify a variable's direction of influence and statistical significance. The usefulness of this approach is demonstrated on an artificial data set and a social survey with a Python sandbox implementation.  ( 2 min )
    Augmented Learning of Heterogeneous Treatment Effects via Gradient Boosting Trees. (arXiv:2302.01367v1 [stat.ML])
    Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.  ( 2 min )
  • Open

    Post-Selection Confidence Bounds for Prediction Performance. (arXiv:2210.13206v3 [stat.ML] UPDATED)
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.  ( 3 min )
    Realizable Learning is All You Need. (arXiv:2111.04746v3 [cs.LG] UPDATED)
    The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.  ( 2 min )
    Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. (arXiv:2210.02720v2 [cs.LG] UPDATED)
    Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.  ( 2 min )
    New Machine Learning Techniques for Simulation-Based Inference: InferoStatic Nets, Kernel Score Estimation, and Kernel Likelihood Ratio Estimation. (arXiv:2210.01680v2 [stat.ML] UPDATED)
    We propose an intuitive, machine-learning approach to multiparameter inference, dubbed the InferoStatic Networks (ISN) method, to model the score and likelihood ratio estimators in cases when the probability density can be sampled but not computed directly. The ISN uses a backend neural network that models a scalar function called the inferostatic potential $\varphi$. In addition, we introduce new strategies, respectively called Kernel Score Estimation (KSE) and Kernel Likelihood Ratio Estimation (KLRE), to learn the score and the likelihood ratio functions from simulated data. We illustrate the new techniques with some toy examples and compare to existing approaches in the literature. We mention en passant some new loss functions that optimally incorporate latent information from simulations into the training procedure.  ( 2 min )
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v2 [stat.ML] UPDATED)
    Spurious correlations, or correlations that change across domains where a model can be deployed, present significant challenges to real-world applications of machine learning models. However, such correlations are not always "spurious"; often, they provide valuable prior information for a prediction. Here, we present a test-time adaptation method that exploits the spurious correlation phenomenon, in contrast to recent approaches that attempt to eliminate spurious correlations through invariance. We consider situations where the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, but the generative model for features $p(\mathbf{x}|y, z)$ is constant. We note that this corresponds to an expanded version of the label shift assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we train a classifier to predict $p(y, z|\mathbf{x})$ on the source distribution, and propose a test-time label shift correction that adapts to changes in the marginal distribution $p(y, z)$ using unlabeled samples from the target domain. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on two different image datasets -- the CheXpert chest X-ray dataset and the Colored MNIST dataset -- and show a significant improvement over baseline methods. Code reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .  ( 2 min )
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v2 [physics.ao-ph] UPDATED)
    Machine learning models have been employed to perform either physics-free data-driven or hybrid dynamical downscaling of climate data. Most of these implementations operate over relatively small downscaling factors because of the challenge of recovering fine-scale information from coarse data. This limits their compatibility with many global climate model outputs, often available between $\sim$50--100 km resolution, to scales of interest such as cloud resolving or urban scales. This study systematically examines the capability of convolutional neural networks (CNNs) to downscale surface wind speed data over land surface from different coarse resolutions (25 km, 48 km, and 100 km resolution) to 3 km. For each downscaling factor, we consider three CNN configurations that generate super-resolved predictions of fine-scale wind speed, which take between 1 to 3 input fields: coarse wind speed, fine-scale topography, and diurnal cycle. In addition to fine-scale wind speeds, probability density function parameters are generated, through which sample wind speeds can be generated accounting for the intrinsic stochasticity of wind speed. For generalizability assessment, CNN models are tested on regions with different topography and climate that are unseen during training. The evaluation of super-resolved predictions focuses on subgrid-scale variability and the recovery of extremes. Models with coarse wind and fine topography as inputs exhibit the best performance compared with other model configurations, operating across the same downscaling factor. Our diurnal cycle encoding results in lower out-of-sample generalizability compared with other input configurations.  ( 2 min )
    Consistent Range Approximation for Fair Predictive Modeling. (arXiv:2212.10839v2 [cs.LG] UPDATED)
    This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.  ( 2 min )
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v2 [stat.ML] UPDATED)
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.  ( 2 min )
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v3 [stat.ME] UPDATED)
    We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.  ( 2 min )
    Learning Counterfactually Invariant Predictors. (arXiv:2207.09768v2 [cs.LG] UPDATED)
    Counterfactual invariance has proven an essential property for predictors that are fair, robust, and generalizable in the real world. We propose a general definition of counterfactual invariance and provide simple graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of (conditional independence in) the observational distribution. Any predictor that satisfies our criterion is provably counterfactually invariant. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactual Invariance Prediction (CIP), based on a kernel-based conditional dependence measure called Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various types of data including tabular, high-dimensional, and real-world dataset.  ( 2 min )
    DEUP: Direct Epistemic Uncertainty Prediction. (arXiv:2102.08501v4 [cs.LG] UPDATED)
    Epistemic Uncertainty is a measure of the lack of knowledge of a learner which diminishes with more evidence. While existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, we argue that this does not capture the part of lack of knowledge induced by model misspecification. We discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor, is a sound measure of epistemic uncertainty which captures the effect of model misspecification. We thus propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. We discuss the merits of this novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty and addresses its major pitfall. Our framework, Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round. Through a wide set of experiments, we illustrate how existing methods in sequential model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v2 [stat.ML] UPDATED)
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.  ( 2 min )
    From Robustness to Privacy and Back. (arXiv:2302.01855v1 [cs.LG])
    We study the relationship between two desiderata of algorithms in statistical inference and machine learning: differential privacy and robustness to adversarial data corruptions. Their conceptual similarity was first observed by Dwork and Lei (STOC 2009), who observed that private algorithms satisfy robustness, and gave a general method for converting robust algorithms to private ones. However, all general methods for transforming robust algorithms into private ones lead to suboptimal error rates. Our work gives the first black-box transformation that converts any adversarially robust algorithm into one that satisfies pure differential privacy. Moreover, we show that for any low-dimensional estimation task, applying our transformation to an optimal robust estimator results in an optimal private estimator. Thus, we conclude that for any low-dimensional task, the optimal error rate for $\varepsilon$-differentially private estimators is essentially the same as the optimal error rate for estimators that are robust to adversarially corrupting $1/\varepsilon$ training samples. We apply our transformation to obtain new optimal private estimators for several high-dimensional tasks, including Gaussian (sparse) linear regression and PCA. Finally, we present an extension of our transformation that leads to approximate differentially private algorithms whose error does not depend on the range of the output space, which is impossible under pure differential privacy.  ( 2 min )
    Sample Complexity of Probability Divergences under Group Symmetry. (arXiv:2302.01915v1 [math.ST])
    We rigorously quantify the improvement in the sample complexity of variational divergence estimations for group-invariant distributions. In the cases of the Wasserstein-1 metric and the Lipschitz-regularized $\alpha$-divergences, the reduction of sample complexity is proportional to an ambient-dimension-dependent power of the group size. For the maximum mean discrepancy (MMD), the improvement of sample complexity is more nuanced, as it depends on not only the group size but also the choice of kernel. Numerical simulations verify our theories.  ( 2 min )
    Certified Robustness of Learning-based Static Malware Detectors. (arXiv:2302.01757v1 [cs.CR])
    Certified defenses are a recent development in adversarial machine learning (ML), which aim to rigorously guarantee the robustness of ML models to adversarial perturbations. A large body of work studies certified defenses in computer vision, where $\ell_p$ norm-bounded evasion attacks are adopted as a tractable threat model. However, this threat model has known limitations in vision, and is not applicable to other domains -- e.g., where inputs may be discrete or subject to complex constraints. Motivated by this gap, we study certified defenses for malware detection, a domain where attacks against ML-based systems are a real and current threat. We consider static malware detection systems that operate on byte-level data. Our certified defense is based on the approach of randomized smoothing which we adapt by: (1) replacing the standard Gaussian randomization scheme with a novel deletion randomization scheme that operates on bytes or chunks of an executable; and (2) deriving a certificate that measures robustness to evasion attacks in terms of generalized edit distance. To assess the size of robustness certificates that are achievable while maintaining high accuracy, we conduct experiments on malware datasets using a popular convolutional malware detection model, MalConv. We are able to accurately classify 91% of the inputs while being certifiably robust to any adversarial perturbations of edit distance 128 bytes or less. By comparison, an existing certification of up to 128 bytes of substitutions (without insertions or deletions) achieves an accuracy of 78%. In addition, given that robustness certificates are conservative, we evaluate practical robustness to several recently published evasion attacks and, in some cases, find robustness beyond certified guarantees.  ( 2 min )
    Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. (arXiv:2302.01771v1 [stat.ML])
    Deep learning (DL) has emerged as a promising tool to downscale climate projections at regional-to-local scales from large-scale atmospheric fields following the perfect-prognosis (PP) approach. Given their complexity, it is crucial to properly evaluate these methods, especially when applied to changing climatic conditions where the ability to extrapolate/generalise is key. In this work, we intercompare several DL models extracted from the literature for the same challenging use-case (downscaling temperature in the CORDEX North America domain) and expand standard evaluation methods building on eXplainable artifical intelligence (XAI) techniques. We show how these techniques can be used to unravel the internal behaviour of these models, providing new evaluation dimensions and aiding in their diagnostic and design. These results show the usefulness of incorporating XAI techniques into statistical downscaling evaluation frameworks, especially when working with large regions and/or under climate change conditions.  ( 2 min )
    Leveraging a Probabilistic PCA Model to Understand the Multivariate Statistical Network Monitoring Framework for Network Security Anomaly Detection. (arXiv:2302.01759v1 [stat.ML])
    Network anomaly detection is a very relevant research area nowadays, especially due to its multiple applications in the field of network security. The boost of new models based on variational autoencoders and generative adversarial networks has motivated a reevaluation of traditional techniques for anomaly detection. It is, however, essential to be able to understand these new models from the perspective of the experience attained from years of evaluating network security data for anomaly detection. In this paper, we revisit anomaly detection techniques based on PCA from a probabilistic generative model point of view, and contribute a mathematical model that relates them. Specifically, we start with the probabilistic PCA model and explain its connection to the Multivariate Statistical Network Monitoring (MSNM) framework. MSNM was recently successfully proposed as a means of incorporating industrial process anomaly detection experience into the field of networking. We have evaluated the mathematical model using two different datasets. The first, a synthetic dataset created to better understand the analysis proposed, and the second, UGR'16, is a specifically designed real-traffic dataset for network security anomaly detection. We have drawn conclusions that we consider to be useful when applying generative models to network security detection.  ( 2 min )
    ResMem: Learn what you can and memorize the rest. (arXiv:2302.01576v1 [cs.LG])
    The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.  ( 2 min )
    Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits. (arXiv:2302.01544v1 [cs.LG])
    In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.  ( 2 min )
    Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. (arXiv:2302.01629v1 [stat.ML])
    Machine learning models are vulnerable to adversarial perturbations, and a thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon through the lens of over-parameterization: interpolating smoothly the data requires significantly more parameters than simply memorizing it. However, this "universal" law provides only a necessary condition for robustness, and it is unable to discriminate between models. In this paper, we address these gaps by focusing on empirical risk minimization in two prototypical settings, namely, random features and the neural tangent kernel (NTK). We prove that, for random features, the model is not robust for any degree of over-parameterization, even when the necessary condition coming from the universal law of robustness is satisfied. In contrast, for even activations, the NTK model meets the universal lower bound, and it is robust as soon as the necessary condition on over-parameterization is fulfilled. This also addresses a conjecture in prior work by Bubeck, Li and Nagaraj. Our analysis decouples the effect of the kernel of the model from an "interaction matrix", which describes the interaction with the test data and captures the effect of the activation. Our theoretical results are corroborated by numerical evidence on both synthetic and standard datasets (MNIST, CIFAR-10).  ( 2 min )
    Using natural language processing and structured medical data to phenotype patients hospitalized due to COVID-19. (arXiv:2302.01536v1 [cs.CL])
    To identify patients who are hospitalized because of COVID-19 as opposed to those who were admitted for other indications, we compared the performance of different computable phenotype definitions for COVID-19 hospitalizations that use different types of data from the electronic health records (EHR), including structured EHR data elements, provider notes, or a combination of both data types. And conduct a retrospective data analysis utilizing chart review-based validation. Participants are 586 hospitalized individuals who tested positive for SARS-CoV-2 during January 2022. We used natural language processing to incorporate data from provider notes and LASSO regression and Random Forests to fit classification algorithms that incorporated structured EHR data elements, provider notes, or a combination of structured data and provider notes. Results: Based on a chart review, 38% of 586 patients were determined to be hospitalized for reasons other than COVID-19 despite having tested positive for SARS-CoV-2. A classification algorithm that used provider notes had significantly better discrimination than one that used structured EHR data elements (AUROC: 0.894 vs 0.841, p < 0.001), and performed similarly to a model that combined provider notes with structured data elements (AUROC: 0.894 vs 0.893). Assessments of hospital outcome metrics significantly differed based on whether the population included all hospitalized patients who tested positive for SARS-CoV-2 versus those who were determined to have been hospitalized due to COVID-19. This work demonstrates the utility of natural language processing approaches to derive information related to patient hospitalizations in cases where there may be multiple conditions that could serve as the primary indication for hospitalization.  ( 3 min )
    Where and How to Improve Graph-based Spatio-temporal Predictors. (arXiv:2302.01701v1 [stat.ML])
    This paper introduces a novel residual correlation analysis, called AZ-analysis, to assess the optimality of spatio-temporal predictive models. The proposed AZ-analysis constitutes a valuable asset for discovering and highlighting those space-time regions where the model can be improved with respect to performance. The AZ-analysis operates under very mild assumptions and is based on a spatio-temporal graph that encodes serial and functional dependencies in the data; asymptotically distribution-free summary statistics identify existing residual correlation in space and time regions, hence localizing time frames and/or communities of sensors, where the predictor can be improved.  ( 2 min )
    A Lipschitz Bandits Approach for Continuous Hyperparameter Optimization. (arXiv:2302.01539v1 [cs.LG])
    One of the most critical problems in machine learning is HyperParameter Optimization (HPO), since choice of hyperparameters has a significant impact on final model performance. Although there are many HPO algorithms, they either have no theoretical guarantees or require strong assumptions. To this end, we introduce BLiE -- a Lipschitz-bandit-based algorithm for HPO that only assumes Lipschitz continuity of the objective function. BLiE exploits the landscape of the objective function to adaptively search over the hyperparameter space. Theoretically, we show that $(i)$ BLiE finds an $\epsilon$-optimal hyperparameter with $O \left( \frac{1}{\epsilon} \right)^{d_z + \beta}$ total budgets, where $d_z$ and $\beta$ are problem intrinsic; $(ii)$ BLiE is highly parallelizable. Empirically, we demonstrate that BLiE outperforms the state-of-the-art HPO algorithms on benchmark tasks. We also apply BLiE to search for noise schedule of diffusion models. Comparison with the default schedule shows that BLiE schedule greatly improves the sampling speed.  ( 2 min )
    Support Recovery in Sparse PCA with Non-Random Missing Data. (arXiv:2302.01535v1 [stat.ML])
    We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems.  ( 2 min )
    Failure-informed adaptive sampling for PINNs, Part II: combining with re-sampling and subset simulation. (arXiv:2302.01529v1 [math.NA])
    This is the second part of our series works on failure-informed adaptive sampling for physic-informed neural networks (FI-PINNs). In our previous work \cite{gao2022failure}, we have presented an adaptive sampling framework by using the failure probability as the posterior error indicator, where the truncated Gaussian model has been adopted for estimating the indicator. In this work, we present two novel extensions to FI-PINNs. The first extension consist in combining with a re-sampling technique, so that the new algorithm can maintain a constant training size. This is achieved through a cosine-annealing, which gradually transforms the sampling of collocation points from uniform to adaptive via training progress. The second extension is to present the subset simulation algorithm as the posterior model (instead of the truncated Gaussian model) for estimating the error indicator, which can more effectively estimate the failure probability and generate new effective training points in the failure region. We investigate the performance of the new approach using several challenging problems, and numerical experiments demonstrate a significant improvement over the original algorithm.  ( 2 min )
    Dataset Distillation Fixes Dataset Reconstruction Attacks. (arXiv:2302.01428v1 [cs.LG])
    Modern deep learning requires large volumes of data, which could contain sensitive or private information which cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effectiveness regime, and ways to defend against it. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover its entire training set in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. More importantly, we formally show for the first time that dataset reconstruction attacks are a variation of dataset distillation. This key theoretical result on the unification of dataset reconstruction and distillation not only sheds more light on the characteristics of the attack but enables us to design defense mechanisms against them via distillation algorithms.  ( 2 min )
    Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective. (arXiv:2302.01425v1 [cs.LG])
    The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.  ( 2 min )
    Augmented Learning of Heterogeneous Treatment Effects via Gradient Boosting Trees. (arXiv:2302.01367v1 [stat.ML])
    Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.  ( 2 min )
    Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen's f2. (arXiv:2302.01407v1 [stat.ME])
    Deep artificial neural networks show high predictive performance in many fields, but they do not afford statistical inferences and their black-box operations are too complicated for humans to comprehend. Because positing that a relationship exists is often more important than prediction in scientific experiments and research models, machine learning is far less frequently used than inferential statistics. Additionally, statistics calls for improving the test of theory by showing the magnitude of the phenomena being studied. This article extends current XAI methods and develops a model agnostic hypothesis testing framework for machine learning. First, Fisher's variable permutation algorithm is tweaked to compute an effect size measure equivalent to Cohen's f2 for OLS regression models. Second, the Mann-Kendall test of monotonicity and the Theil-Sen estimator is applied to Apley's accumulated local effect plots to specify a variable's direction of influence and statistical significance. The usefulness of this approach is demonstrated on an artificial data set and a social survey with a Python sandbox implementation.  ( 2 min )

  • Open

    12 highlights from Google's BARD announcement
    I went through the entire blog post from Google and pulled out some quotes and highlights: ​ 1) “we re-oriented the company around AI six years ago” Right off the bat, “Pich-AI” lets it be known that Google is now an AI company. Partially true? Yes, of course. Would that phrase be coming out of his mouth at this point if not for the release and success of ChatGPT? No. 2) their mission: “organize the world’s information and make it universally accessible and useful” There’s a book called The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail. I'm certainly not here to say that Google is going to fail, but the re-stating of the mission makes it clear that they view AI (and Bard) as a way to improve, supplement, and perhaps protect their search business. This is …  ( 45 min )
    Weekly China AI News: Baidu's Language Model Behind Rumored ChatGPT Search; Tencent-Backed Robot Startup Files for Hong Kong IPO; Xpeng Targets "Full Autonomy" in 2023
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    The Blair Witch Project come to life with Midjourney as an 80's Horror Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    Harry Potter come to life with Midjourney as an 80s Love Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    [Project] I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! (Video and Model link in post)
    submitted by /u/VR_Angel [link] [comments]  ( 41 min )
    Over the weekend I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! Enjoy!
    submitted by /u/VR_Angel [link] [comments]  ( 40 min )
    Built a Telegram AI tutor bot + updates
    Hey! Since first posting here we've got 800+ users taking almost 1,000 courses! In short, this is how the bot works: 1️⃣ Send the captain a topic - usually, one or a few words is enough 2️⃣ Get a mini-course divided into 5 chapters 3️⃣ Receive your content packed into a beautiful magazine-style sharable link Would love to know what you think! http://edwardbot.com/ submitted by /u/Itaydr [link] [comments]  ( 41 min )
    AI Dream 144 - DARK CLOUD AI FUSION - MINDBLOW MONDAY
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Google launches ChatGPT competitor "Bard" and more
    submitted by /u/Peaking_AI [link] [comments]  ( 40 min )
    Change My Mind: You either believe consciousness is biomechanistic and therefore replicable, or you believe consciousness comes from some supernatural source and therefore not replicable. If it's the former, then you should believe AI is/can be sentient.
    I'll just try and add some detail in case the title doesn't make sense for some. The two major options I see: You believe consciousness arises from a very complicated system of biology > cells > molecules > atoms > elements. Ultimately a mechanistic view that, metaphorically, life is complicated machinery. If not the former, then it seems the only other major option is that life is somehow endowed some supernatural or divine force that humans could never synthetically replicate. This is of course generally considered a spiritual/religious belief. So if you believe in item 1, doesn't this mean you pretty much have to believe that AI is or can have consciousness? submitted by /u/sidianmsjones [link] [comments]  ( 44 min )
    OpenAI Have Started a Search Engine Revolution.
    submitted by /u/shauryadevil [link] [comments]  ( 40 min )
    AI Seinfeld show banned on Twitch for transphobic comments
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 40 min )
    Gen-1: The Next Step Forward for Generative AI - Use words and images to generate new videos out of existing ones
    submitted by /u/magenta_placenta [link] [comments]  ( 40 min )
    Spotify's Founder Has Developed An AI-Powered Body Health Scanner
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    runway announces GEN-1 — video to video generative AI
    submitted by /u/AR_MR_XR [link] [comments]  ( 40 min )
    Integrate OpenAI with .NET Core and Angular15
    submitted by /u/TheDotnetoffice [link] [comments]  ( 40 min )
    Google invests 500 million dollars in the next rival of ChatGPT
    submitted by /u/nikesh96 [link] [comments]  ( 41 min )
    Seinfeld AI makes transgender joke and gets banned on twitch
    AI Seinfeld Transphobic rant - YouTube submitted by /u/Status_Signal_4083 [link] [comments]  ( 42 min )
    Hi-ResNet: High resolution image classifier. (448, 896, 1792 sq.px.)
    submitted by /u/johnGettings [link] [comments]  ( 41 min )
    Spotify’s Founder Has Developed An AI-powered Body Health Scanner
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 41 min )
    ChatFAI: Chat with your favorite characters (updates and a challenge)
    ​ Characters as shown on https://chatfai.com/characters/ Hi everyone! I have recently made some exciting changes to my ChatFAI web app. The public characters library is now live - it's now easy to share and install public characters. Added a regenerate reply option. Created a new plan without any daily limit. I have gotten a lot of help and support from this community. The feedback and support from you all are really helpful and that is how I am improving ChatFAI (based on the feedback and suggestions). So, here I am again. What do you think about the latest updates? Is it going in the right direction? Another challenge I have not resolved yet is finding B2B use cases for ChatFAI. Thank you for your help and support - it's greatly appreciated! submitted by /u/usamaejazch [link] [comments]  ( 42 min )
    Streamline Ticket Triage and Reduce Customer Churn with AI
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    I Made a Text Bot Powered by ChatGPT, DALLE 2, and Wolfram Alpha
    submitted by /u/ImplodingCoding [link] [comments]  ( 44 min )
  • Open

    MC vs TD(0) on windy gridworld
    Hi, I'm relatively new to the field of RL but with strong experience in DL, I'm currently studying the Sutton & Barto book, and more specifically TD(0) method. At some point the discuss a "windy" gridworld example, in that we have the normal gridworld setup, but there is an upward wind on some tiles that may move the agent upwards as many squares as indicated in the bottom of each column. Here is a schematic of the setup: Windy Gridworld example from Sutton Barto book If you perform an action that would lead to the agent falling of the grid, the agent remains at the state it was. E.G. if the agent is in the top-left square and performs the action `up`, they will remain at the top-left square. They essentially use this to present SARSA, but then they mention the following for MC methods: Note that Monte Carlo methods cannot easily be used here because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. This strikes me quite as odd. Does this mean that I cannot (or that it's not safe to) use MC methods in any problem where it's possible to do a transition from a state that will result in the agent remaining in the same state? The first thing that comes to mind is the normal gridworld example (i.e. w/out any wind), does this restriction that they mention here mean that MC is not safe to use? submitted by /u/dep0 [link] [comments]  ( 42 min )
    [Discussion] League of Legends Reinforcement Learning Library - Interest
    Hello everyone. I am considering making a reinforcement learning library for the most recent version of League of Legends based on discussions on an existing library here. Would there be any interest in an RL library for League of Legends? The interface would work like the following: ```python import tlol.gym as leaguegym env = leaguegym.make() while True: obs = env.reset() done = False while not done: #Here we sample a random action. If you have an agent, you would get an action from it here. action = env.action_space.sample() next_obs, reward, done, gameinfo = env.step(action) obs = next_obs ``` If you look at my previous posts, I have already created an RL environment for League of Legends v4.20 where other people have also taken the project and successfully trained agents which learn adversarially against each other here. I've also released many gameplay datasets for League of Legends during Season 12 here also for supervised learning and RL. At the moment tlol-py contains an interface for ML models to play League of Legends but I'm considering creating a purpose built library for RL for League of Legends. Would there be interest in a project like this? submitted by /u/Ok-Alps-7918 [link] [comments]  ( 42 min )
    Why the sim2real problem in robotic manipulation?
    Hi all, assuming the task is opening the door with a robot, as far as I understand the sim2real problem happens as the robot behaves differently in the real world as the physics in the simulator (where the agent is trained) are not 100% identical in the real world. From my understanding the sim2real problem occurs if we let the agent also handle this controller part. But why cant we just extract the trajectory of the manipulator that the agent generates to open the door and executes it with the controller from the real world? Am I missing something here? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 43 min )
    Question on return values of the .step() method in a multi-agent environment
    Is it possible that, in a multi-agent environment, the return values for reward, terminated, and truncated of a .step(actions) method call have different values for each agent? If so, is there an example environment? submitted by /u/Toni-SM [link] [comments]  ( 41 min )
    I have implemented an RL agent for trading EUR/USD and I don't know what to do next...
    So, after months of learning about RL and doing toy implentations, I have coded a DQN, with experience buffer and dual nets. The network design is like the most average thing you can come across in ML scene. A simple deep feed forward with Relu and Linear as activation functions. I have also coded a simplified version of the Forex market for my agent to train in. It has bid ask prices, leverage, call margin, and buy/sell/not-in-the-market positions. The whole given state to the model is nothing fancy. It is merely the historical, model's balance and a few binary indictors about the environment. Since I'm cripplingly poor, I don't have any specialized hardware for training the model. After burning like 100 hours into the free version of Google collab with three different learning rates I…  ( 46 min )
    Why is my PPO algorithm not learning a simple environment?
    I have made a Stack Overflow post here. I will highly appreciate all your help on this. Thank you! submitted by /u/Academic-Rent7800 [link] [comments]  ( 42 min )
    Does it make sense to use RL for trading?
    I have seen some blog posts and papers about using RL for financial trading. I have to be hones, I didn't read that stuff in details. However the main idea seems kind of clear: you can model the market as a MDP where the state space encodes the relevant features of the market and your current portfolio, the possible actions are what/how much to sell/buy and the reward function should express the value change of your portfolio. However I am a bit puzzled. Clearly your actions as a trader do not really affect the market that much meaning that the transition function (probability distribution for the next state) does not depend on your action (excluding the changes in your portfolio). Why would RL provide any advantages over more classical approaches? I might miss something, but maybe formalizing the trading problem as a (contextual) multi-armed bandit seems more reasonable to me. submitted by /u/AdministrativeBank48 [link] [comments]  ( 48 min )
  • Open

    [Discussion] League of Legends Reinforcement Learning Library - Interest
    Hello everyone. I am considering making a reinforcement learning library for the most recent version of League of Legends based on discussions on an existing library here. Would there be any interest in an RL library for League of Legends? The interface would work like the following: ```python import tlol.gym as leaguegym env = leaguegym.make() while True: obs = env.reset() done = False while not done: #Here we sample a random action. If you have an agent, you would get an action from it here. action = env.action_space.sample() next_obs, reward, done, gameinfo = env.step(action) obs = next_obs ``` If you look at my previous posts, I have already created an RL environment for League of Legends v4.20 where other people have also taken the project and successfully trained agents which learn adversarially against each other here. I've also released many gameplay datasets for League of Legends during Season 12 here also for supervised learning and RL. At the moment tlol-py contains an interface for ML models to play League of Legends but I'm considering creating a purpose built library for RL for League of Legends. Would there be interest in a project like this? submitted by /u/Ok-Alps-7918 [link] [comments]  ( 43 min )
    [Project] Need Suggestions Improving the Model evaluation scores.
    Hi, I'm working on a project, where we're to classify a user into High or Low Income. The dataset contains 9000+ features and the number of observations/rows are, 30000 representing household. The features include the media consumption habits of people. Hourly, Weekly, Monthly and Yearly for different TV channels. So far I have tried SVC, Random Forest and Logistic Regression. I used an ensemble of these three. However, I haven't been able to get past 63% accuracy. I tried PCA, however, the results range b/w 61-63% accuracy, recall and precision. I do wanna add that the data is already scaled between 0-1 and most of the columns are sparse (0 values for many rows). Honestly, I have tried pretty much everything, but can't seem to raise the evaluation metrics. Can someone direct me to the right path on what I can do to improve the scores? submitted by /u/Toko_yami [link] [comments]  ( 43 min )
    [P] I have implemented an RL agent for trading EUR/USD and I don't know what to do next...
    So, after months of learning about RL and doing toy implentations, I have coded a DQN, with experience buffer and dual nets. The network design is like the most average thing you can come across in ML scene. A simple deep feed forward with Relu and Linear as activation functions. I have also coded a simplified version of the Forex market for my agent to train in. It has bid ask prices, leverage, call margin, and buy/sell/not-in-the-market positions. The whole given state to the model is nothing fancy. It is merely the historical, model's balance and a few binary indictors about the environment. Since I'm cripplingly poor, I don't have any specialized hardware for training the model. After burning like 100 hours into the free version of Google collab with three different learning rates I…  ( 45 min )
    [Project] I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! (Video and Model link in post)
    It took me about 46 hours to run this on my 3080 at home. The original files was from the Blu-ray release that was unfortunately pretty poorly done in my opinion. This version really gives it new life I think. Here's a link to the video result to see for yourself: https://vimeo.com/796411232 And a link to the model I used! https://github.com/TencentARC/AnimeSR submitted by /u/VR_Angel [link] [comments]  ( 43 min )
    [P] Looking for string generation GAN
    I have had minimal luck finding documentation on creating or using a premade string generator. It can't be a text generator really... because I am building it for translation from one language to another. I want to teach the generator to produce guesses on what the best translation for a single word would be based on underlying language semantics. I don't need it to be accurate necessarily, just a point-of-reference for observing language and phonetic mechanics. submitted by /u/lullaby876 [link] [comments]  ( 42 min )
    [D] What techniques can I use to tell if a problem is likely enough to be solved by ML so as to justify compiling the dataset?
    I have a problem that if I solve it with ML, I'll make money, with an outside chance of it being a lot of money. Compiling a dataset will take significant work. Are there any techniques that I can apply to let me know if this is going to be worth it? Perhaps there are certain hallmarks that a problem would have if it is likely to be solvable with available data? Maybe something I can do with a small initial dataset? Thanks. submitted by /u/SnuggleWuggleSleep [link] [comments]  ( 43 min )
    [N] Google: An Important Next Step On Our AI Journey
    https://blog.google/technology/ai/bard-google-ai-search-updates/ submitted by /u/EducationalCicada [link] [comments]  ( 50 min )
    [N] Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement
    From the article: Getty Images has filed a lawsuit in the US against Stability AI, creators of open-source AI art generator Stable Diffusion, escalating its legal battle against the firm. The stock photography company is accusing Stability AI of “brazen infringement of Getty Images’ intellectual property on a staggering scale.” It claims that Stability AI copied more than 12 million images from its database “without permission ... or compensation ... as part of its efforts to build a competing business,” and that the startup has infringed on both the company’s copyright and trademark protections. This is different from the UK-based news from weeks ago. submitted by /u/Wiskkey [link] [comments]  ( 44 min )
    Does the high dimensionality of AI systems that model the real world tell us something about the abstract space of ideas? [D]
    Physical world we live in has 4 dimensions, string theory posits like up to 10. It seems like in order to successfully model the abstract space of ideas which relates things in the physical world to each other and describes them, machine learning needs thousands of dimensions. Also to the extent that ML algos/matrices can be made sparse, that seems to me to tell us something about the density of the mapping between abstract space and physical space... anyone know any papers w/this line of thinking? It also seems a bit unintuitive to me because it seems like geometrically space gets exponentially more complicated as you add dimensions but ML scales linearly or better in many cases with matrix dimensionality. submitted by /u/Frumpagumpus [link] [comments]  ( 44 min )
    [P] Forecasting methods in Time Series
    Hi all! For the longest time, I was having issues understanding how to use time series to do forecasting. Over the last few weeks, I have been writing a series of posts to guide anyone through the process! I am also in the process of writing a detailed practical guide with step-by-step instructions. ​ Right now I have 6 articles on the topic: * Introduction to ARIMA models (https://mlpills.dev/time-series/introduction-to-arima-models/) * Parameters selection in ARIMA models (https://mlpills.dev/time-series/parameters-selection-in-arima-models/) * Seasonal ARIMA (https://mlpills.dev/time-series/seasonal-arima/) * ARCH / GARCH models for Time Series (https://mlpills.dev/time-series/arch-garch-models-for-time-series/) * ARIMA-GARCH models (https://mlpills.dev/time-series/arima-garch-models/) * And today's -> Forecasting in Time Series (https://mlpills.dev/time-series/forecasting-in-time-series/) Let me know if there are any topics that you would like me to cover in the future! submitted by /u/daansan-ml [link] [comments]  ( 43 min )
    Which strategies,framework and applications tools can be implement to automatically monitor the health of the machine learning model? [D]
    Machine LearningModels when deployed in the production environment, model degradation can arise where their output will change if the relationship between the incoming serving data and the predicted target drifts apart. Please can someone briefly elaborate on what strategies, frameworks and application tools can be implemented to automatically monitor the health of the model and alert the Data Scientist of any decay in data quality, data drift, and model quality? submitted by /u/astronaut1971 [link] [comments]  ( 42 min )
    [P] I made image clustering and captioning tools
    I made an image captioning and clustering tools for computer vision and diffusion projects. You can run almost everything automatically and with a simple CLI command. All contributions are welcome. https://github.com/cobanov/image-clustering https://github.com/cobanov/image-captioning submitted by /u/metover [link] [comments]  ( 42 min )
    High-speed cameras and deep learning [Research]
    I haven’t been able to find research on deep learning using high-speed cameras that capture images at frame rates higher than 250fps. I wonder if they are rather useless for image/video processing or do any of you have any ideas about potential applications. submitted by /u/A15L [link] [comments]  ( 44 min )
    [R] Research trends in Graph Neural Networks (GNN)
    Deep connections discovered between Graph Diffusion Networks and Partial Differential Equations modelling heat transfer. https://towardsdatascience.com/graph-neural-networks-as-neural-diffusion-pdes-8571b8c0c774 https://arxiv.org/abs/2106.10934 Strange connections uncovered between GNNs and Structural Causal Models. https://arxiv.org/abs/2109.04173 https://www.youtube.com/watch?v=XC-Bfg3dO0I GNNs used to enhance the factualness of LLMs by providing embeddings from Knowledge Graphs (KEs). https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00360/98089 GNNs used to categorize objects from only their 3D mesh. https://arxiv.org/pdf/2106.15778.pdf Prediction of intuitive physics among physical objects. https://proceedings.neurips.cc/paper/2016/hash/3147da8ab4a0437c15ef51a5cc7f2dc4-Abstract.html Zero-shot generalization in robot Task Planning. https://arxiv.org/abs/2102.13177 https://www.youtube.com/watch?v=POxaTDAj7aY submitted by /u/moschles [link] [comments]  ( 42 min )
    [R] Creating a Large Language Model of a Philosopher
    Paper : https://arxiv.org/abs/2302.01339 Abstract : Can large language models be trained to produce philosophical texts that are difficult to distinguish from texts produced by human philosophers? To address this question, we fine-tuned OpenAI's GPT-3 with the works of philosopher Daniel C. Dennett as additional training data. To explore the Dennett model, we asked the real Dennett ten philosophical questions and then posed the same questions to the language model, collecting four responses for each question without cherry-picking. We recruited 425 participants to distinguish Dennett's answer from the four machine-generated answers. Experts on Dennett's work (N = 25) succeeded 51% of the time, above the chance rate of 20% but short of our hypothesized rate of 80% correct. For two of the ten questions, the language model produced at least one answer that experts selected more frequently than Dennett's own answer. Philosophy blog readers (N = 302) performed similarly to the experts, while ordinary research participants (N = 98) were near chance distinguishing GPT-3's responses from those of an "actual human philosopher". submitted by /u/starstruckmon [link] [comments]  ( 44 min )
    [P] I Made a Text Bot Powered by ChatGPT, DALLE 2, and Wolfram Alpha
    submitted by /u/ImplodingCoding [link] [comments]  ( 43 min )
    [R] deep learning and session-specific rapid recalibration for dynamic hand gesture recognition from EMG
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 44 min )
    [D] AtheneWins just showcased an AI streamer bot, Does anyone know how he did this?
    submitted by /u/imagoons [link] [comments]  ( 42 min )
    [D] RNN and S4 etc
    Hello what's the state of modern RNNs, why does S4 not use nonlinearity on the state vector? What happened to unitary RNN or independent RNN (which sounds like exponential moving average)? submitted by /u/windoze [link] [comments]  ( 42 min )
  • Open

    Automating the math for decision-making under uncertainty
    A new tool brings the benefits of AI programming to a much broader class of problems.  ( 8 min )
  • Open

    Create powerful self-service experiences with Amazon Lex on Talkdesk CX Cloud contact center
    This blog post is co-written with Bruno Mateus, Jonathan Diedrich and Crispim Tribuna at Talkdesk. Contact centers are using artificial intelligence (AI) and natural language processing (NLP) technologies to build a personalized customer experience and deliver effective self-service support through conversational bots. This is the first of a two-part series dedicated to the integration of […]  ( 8 min )
    Image classification model selection using Amazon SageMaker JumpStart
    Researchers continue to develop new model architectures for common machine learning (ML) tasks. One such task is image classification, where images are accepted as input and the model attempts to classify the image as a whole with object label outputs. With many models available today that perform this image classification task, an ML practitioner may […]  ( 11 min )
  • Open

    Is there any value in having non uniform activation functions in the hidden layer?
    I wrote my first neural network implementation following a tutorial. I was amazed that it was less than 140 lines of code. A thought occurred to me and I brought it to chatgpt first, but I'd like some human input from people who know better. I'm wondering what the pros and cons would be for having a network where the hidden layer chooses an activation function when it is initialized that may be different per node. Chatgpt was polite in its responses but it seems very clear that this is a bad idea for most common goals with a neural network. I'm making little creatures that just kind of exist, adding more inputs and outputs as I think of them. So I wondered recently \1. Would there be any benefit from 1 node using a sigmoid and another node using relu? \2. I know it would really screw up the network, but what would happen if I also had a small chance for the activation function to change in a node when creating the next generation? It's baby's first simulation where the best nodes survive to make a new generation. Since I'm using a sort of evolution model I'm not worried about "bad" training results or overly complicated brains, the 2 points chatgpt kept stressing no matter how I asked it. Does anyone have some thoughts they can share on this? It's possible that the question is hard to answer because it's so stupid / pointless for what a neural network would be used for. I get that impression from stack overflow answers to similar questions, hah. I don't have a goal, I just want to watch the little critters exist and struggle. submitted by /u/Tomnnn [link] [comments]  ( 44 min )
    Create a fake person
    Hi. I don’t know if it is a good place to ask this question but i have to make a project for my uni and i am a bit confused. The topic is „creating fake avatar”. I need to create a fake person based on other people images and make it „alive”. By alive I mean that i can create many pictures of this person in different situations. I know i need to use GAN but I just can’t get my head around on how to do it. I mean. First i need a neural network to create a fake person. But how to use this fake person to create different scenarios? Thank you for any help in advance. submitted by /u/Acrobatic_Ad6507 [link] [comments]  ( 42 min )
  • Open

    It’s No Big Deal, but ChatGPT Changes Everything – Part III
    “I’ll tell you the problem with the scientific power that you’re using here: it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part III The post It’s No Big Deal, but ChatGPT Changes Everything – Part III appeared first on Data Science Central.  ( 24 min )
    Guide to Best Flutter State Management Libraries for 2023
    Flutter is a free, open-source mobile user interface (UI) framework that Google developed in 2017. It allows users to create native mobile applications using a single codebase. Using a single codebase and one programming language, users can develop apps for two different platforms, Android and iOS. It’s considered the most effective cross-platform framework available. Flutter… Read More »Guide to Best Flutter State Management Libraries for 2023 The post Guide to Best Flutter State Management Libraries for 2023 appeared first on Data Science Central.  ( 21 min )
    Ensuring Data Security in Realtime Operating System (RTOS) Devices
    Just a few days ago, January 28, we celebrated Data Protection Day, an international event aimed at promoting data privacy and security. In line with the goal of raising awareness about data protection, it would be a good time to discuss data security with Realtime Operating System. This unconventional operating system is widely used, so… Read More »Ensuring Data Security in Realtime Operating System (RTOS) Devices The post Ensuring Data Security in Realtime Operating System (RTOS) Devices appeared first on Data Science Central.  ( 21 min )
  • Open

    AI Joins Hunt for ET: Study Finds 8 Potential Alien Signals
    A University of Toronto undergrad among an international team of researchers unleashing deep learning in the search for extraterrestrial civilizations.  ( 6 min )

  • Open

    [D] Overview of of Chatbot Research?
    Is there a good overview of the state of chatbot research? I'm wondering if the ChatGPT approach of big LLM + RLHF is now considered the only way forward? How about alternatives like BlenderBot3? And what are the best open source chatbots right now? Or if you can't create your own ChatGPT, how does using a GPT3 sized model + prompt engineering compare to smaller models with supervised fine tuning on a conversation dataset? submitted by /u/renbid [link] [comments]  ( 42 min )
    [D] Large language models (LLM) as priority / conflict resolver for embodied AI or in general
    I wanted to discuss the possibilities to use LLM in generating answer based on the context and resolving conflict. Some recent work leveraging LLM in robotics planning, like Language Models as Zero-Shot Planner use LLM to generate plans for robot. What are your views in terms of LLM which leverage the background knowledge and visual clues together to generate correct next action by robots or embodied systems. As a human we decide actions based on resolving priority or conflict based on rules/ concepts , can LLM takes these rules /concept explicitly in decision making to generate new set of actions? Example: while chopping the veggies by robots, if hand comes in between then robot will stop the chopping process of veggies. As chopping task and human hand presence are in conflict and humans hand safety is of higher priority than cutting. How such small-small kind of knowledge be encoded in these robotics system which makes them more safer and trustworthy in general. As LLM requires larges corpus of knowledge/data. submitted by /u/projekt_treadstone [link] [comments]  ( 43 min )
    [D] Is there a database of English language tokens, including all dictionary words and common word segments?
    I find it odd that I have to regenerate this from my input set each time. It should be something we can just start with pre-created. submitted by /u/MrOfficialCandy [link] [comments]  ( 42 min )
    [D] Is English the optimal language to train NLP models on?
    While the greatest amount of training content is available for English at the moment, it seems unlikely to me that it's an efficient language to train AI. A more optimal language would reduce training time and model size. It might, for example, be much more efficient to train AI on Chinese, Korean, or Japanese due to a reduce grammatical token-set when constructing sentences/ideas. But taking the idea further, I wonder if we should be using a human language at all. Perhaps it's more efficient to use something altogether new in order to both communicate with AI more exactingly and also to reduce model size/training. What do y'all think? submitted by /u/MrOfficialCandy [link] [comments]  ( 44 min )
    [R] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 42 min )
    [P] Interactive Map of NeurIPS Proceedings 1987-2022
    submitted by /u/NomicAI [link] [comments]  ( 42 min )
    [P] I made CoPilot for writing LaTeX (in Overleaf) - what do you think?
    submitted by /u/alistairmcleay [link] [comments]  ( 42 min )
    [P] NeuralFit now allows evolution of recurrent neural networks
    Hi all! Some time ago I made a post about NeuralFit (https://github.com/neural-fit/neuralfit), which allows you to evolve neural networks in 🐍 Python with just a few lines of code. Good news: it is now possible to evolve recurrent models on timeseries, which is useful for doing stock market predictions for example 📈. There is currently a simple example, but more examples will be added soon! In addition: the free limitations have been increased, so you can use NeuralFit for most hobby projects without needing a license. Just like last time, feedback is immensely appreciated! submitted by /u/wagenaartje [link] [comments]  ( 43 min )
    [R] Can anyone direct me to academic sources arguing that Big Tech using AI for targeted Social Media ads is a good thing for actual users?
    Been struggling to find sources relating to this, it’s mostly just tech websites or blogs I keep coming across. I’m struggling to find any academic papers arguing for specifically the use of user data to create targeted ads. submitted by /u/lara_lara24 [link] [comments]  ( 43 min )
    [D] How Machine Learning is Transforming Cybersecurity
    submitted by /u/DenofBlerds [link] [comments]  ( 42 min )
    [R] [D] PADL: Language-Directed Physics-Based Character Control by NVIDIA
    submitted by /u/WarmFormal9881 [link] [comments]  ( 42 min )
    [R] [D] The New XOR Problem
    submitted by /u/shawntan [link] [comments]  ( 45 min )
    [P] I made a browser extension that uses ChatGPT to answer every StackOverflow question
    submitted by /u/jsonathan [link] [comments]  ( 46 min )
    Are PhDs in statistics useful for ML research? [D]
    How much do research labs or research jobs in ML hire statisticians vs computer scientists or mathematicians? submitted by /u/AdFew4357 [link] [comments]  ( 44 min )
    Why not use Stable Diffusion’s VAE to get textual embeddings? [D]
    submitted by /u/sudo_fuck_you [link] [comments]  ( 42 min )
    [D] Does the M2 Max 30-core GPU have any advantage over M2 Pro 19-core GPU in Machine Learning Tasks?
    submitted by /u/dona6603 [link] [comments]  ( 42 min )
    [P] tradeslyPro - AI Roboadvisor
    submitted by /u/mrtkp9993 [link] [comments]  ( 42 min )
    [D] List of Large Language Models to play with.
    Hello! I'm trying to understand what available LLMs one can "relatively easily" play with. My goal is to understand the landscape since I haven't worked in this field before. I'm trying to run them "from the largest to the smallest". By "relatively easy", I mean doesn't require to setup a GPU cluster or costs more than $20:) Here are some examples I have found so far: ChatGPT (obviously) - 175B params OpenAI api to access GPT-3s (from ada (0.5B) to davinci (175B)). Also CodeX Bloom (176B) - text window on that page seems to work reliably, you just need to keep pressing "generate" OPT-175B (Facebook LLM), the hosting works surprisingly fast, but slower than ChatGPT Several models on HuggingFace that I made to run with Colab Pro subscription: GPT-NeoX 20B, Flan-t5-xxl 11B, Xlm-roberta-xxl 10.7B, GPT-j 6B. I spent about $20 total on running the models below. None of the Hugging face API interfaces/spaces didn't work for me :(. Here is an example notebook I made for NeoX. Does anyone know more models that are easily accessible? P.S. Some large models I couldn't figure out (yet) how to run easily: Galactica-120b 120B Opt-30b 30B submitted by /u/sinavski [link] [comments]  ( 44 min )
    [N] "I got access to Google LaMDA, the Chatbot that was so realistic that one Google engineer thought it was conscious. First impressions"
    Tweet thread: https://twitter.com/WholeMarsBlog/status/1622139178439036928 First impressions: this sucks ass I can only ask about dogs and a few different types of prompts Does anyone else have experiences to share with this nerfed LaMDA beta google released? submitted by /u/That_Violinist_18 [link] [comments]  ( 44 min )
    Objects Color Matching against a Reference Standard (ColorCODEX)? [D]
    Im trying to build and train a Machine Learning model that autonomously performs color matching between the target gemstone and the Reference Standard color chart. A digital photo image of the target gemstone is first captured in a controlled environment in terms of illumination and background. This digital image is further pre-processed and fed into an algorithm that recognizes and match its color distribution to the closest color in the Reference Standard color chart. Numerous Reference Standards exist but I will use the ColorCODEX (this link ColorCODEX) So I would like to know which Machine Learning Model to use in this case to ensure high matching accuracy and like what performance metric can I use to measure matching accuracy and the color space for the color model. And at the end what image pre-processing needs to be done? I found this article (https://www.atlantis-press.com/proceedings/icosat-17/25895985)with backpropagation NN but not sure if it the best choice. Any other option? submitted by /u/astronaut1971 [link] [comments]  ( 44 min )
    [R] AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 42 min )
    [D] GNN Is node information required ?
    [D] Hey, It's kind of a simple one but just putting it out for opinions: when passing a graph through graph neural networks to obtain vectors for all the nodes. Is the info in the node required because all we care is about the position of certain node in context to the whole graph and that's how gnn outputs the vectors of each node. Sorry if that was messy. submitted by /u/ab_11nav [link] [comments]  ( 42 min )
    What text to speech does this guy use? [R]
    https://youtu.be/ktdUeqzzhiA what text to speech does he use? he's been popping up on my yt feed lately and i can see he has different voices in his videos and most of them sound robotic, what do you think it's being used here? submitted by /u/candidhorse4 [link] [comments]  ( 42 min )
  • Open

    Fullmetal Alchemist as an 80's Dark Fantasy movie (Chapters in Descripti...
    submitted by /u/EIDANart [link] [comments]  ( 40 min )
    AI text generator for news
    Hey all! For an art installation I want to be able to generate war news in specific countries and have a specific word used in it. For instance, a 2-3 sentence news about the war in Yemen with the word "love" in it. Any AI platforms you can suggest? submitted by /u/jarjar_bigh [link] [comments]  ( 41 min )
    GPT3-Assisted Google search, document/video/audio/website/youtube video indexer and composer conveniently built into Discord!
    submitted by /u/yikeshardware [link] [comments]  ( 42 min )
    AI Ethics, Deep Fakes, and the Dark Side of the Algorithm
    submitted by /u/IndependenceFun4627 [link] [comments]  ( 40 min )
    AI Dream 156 - This EPIC AI Video might break Youtube
    submitted by /u/LordPewPew777 [link] [comments]  ( 41 min )
    This web can make voices only with AI
    https://voice.ai/r/AfpKl. This web can create very good quality AI voices. They have majorly famous people voices. For using this voices you only need to write something and then the AI will say it with the voice you choose. submitted by /u/Marcosio9083 [link] [comments]  ( 41 min )
    Most advanced AI for clothing try-on?
    There have been already some ai developers teamed up with big firms to create AI for virtual clothing try-on even few years ago, but considering lastest models can generate images of very realistic human (including garments on them) from nothing I suppose there must be practical ones that can generate better quality try-on image given a series of photos of a model and clothes to put on. Do you know any such models? submitted by /u/PMATHbreaker [link] [comments]  ( 41 min )
    Is Microsoft's Azure OpenAI service generally 'cost effective' for a small number of specialist users in a small business?
    Imagine a type of user that is a senior manager, they do not have an analysis background. What i'd like is for them to be able to tell a dashboard "what happened to our customer reviews last week?" and a text summary is generated and parameters to charts sent out, and charts generated I'm exploring the aspect of cost here I read Azure's OpenAI marketing and see: > Azure OpenAI Service is enabling customers across industries from health care to financial services to manufacturing to quickly perform an array of tasks. Innovations include generating unique content for customers, summarizing and classifying customer feedback, and extracting text from medical records to streamline billing. The most common uses have been writing assistance, translating natural language to code and gaining data insights through search, entity extraction, sentiment and classification. Great But the pricing looks expensive for Base Series Fine-tuned, Curie: > Inferencing per 1,000 tokens | $0.002 > Hosting per hour | $0.24 and so i'd need to think about some caching intermediary stage to prevent too many calls to the Azure OpenAI API So is using Azure here for an ai based solution cost effective for a small business? submitted by /u/Work_Owl [link] [comments]  ( 42 min )
    Universities are acting like a fish out of the water due to AI
    submitted by /u/foundersblock [link] [comments]  ( 40 min )
    Starting a career in AI
    Starting a career in AI Hello folks, I’m a Product Manager with 10 years of experience in B2C mobile apps. I love mobile apps and I’m really passionate about user experience. However, I lately started using various AI products and I’m really impressed. Since I believe it the future I was thinking if it’s possible to start learning the basics and even make a career change in AI development (as soon as I learn what fields exist). I have studied computer science and I’m familiar with software development. Any idea/recommendations where to start from and what a possible career path would be? Thanks submitted by /u/EquivalentMongoose95 [link] [comments]  ( 42 min )
    Technology Readiness Levels (TRL) in AI development
    How can we move from an idea to production in AI? Does the technology readiness levels (TRL) help? If you want to get some answers please read this article in medium: https://medium.com/towards-artificial-intelligence/technology-readiness-levels-trl-in-ai-development-c6ed1190fbd6 All the ideas are more than welcome! submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 40 min )
    Request: Voice cloning app. (Website or app)
    I am looking for a free voice cloning app or website (Like elevenlabs, but I would need a quota bypass for that) and I can't find any. Any suggestions? submitted by /u/1sydxyz [link] [comments]  ( 40 min )
    I have a question for you? I am in need of specific AI tool I am not aware if exists yet.
    I am searching for an image and I need to find where the image comes from and it can be anywhere on the web or as a part of video as whole so is there an AI you can give a name , the said picture, photo, sentence or a specific data to find the exact photo in a random youtube video, the same name on some old site or inclusion of A photo I have on another site where it may have originated. An Ai that will scan the entire youtube , watch every vid and then list all the instances where the image was used. submitted by /u/Plajomzn [link] [comments]  ( 42 min )
    ChatGPT Becomes Fastest Growing App, Beating TikTok In Popularity
    submitted by /u/liquidocelotYT [link] [comments]  ( 42 min )
    The Fast and the Furious come to life with Midjourney as an 80's Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    Ai that let’s you Play As Neo in the matrix, in an open ended movie experience
    submitted by /u/techmanj [link] [comments]  ( 40 min )
    Why Overfitting and Underfitting Happen
    Hi guys, I have made a video on YouTube here where I explain why underfitting and overfitting happen in machine learning models by looking at the fundamental theory behind bias variance trade-off. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    How will AI affect Day-to-day business?
    https://buddingmanager.com/2023/01/26/https-buddingmanager-com-artificialintelligenceinbusiness/ submitted by /u/OriginalRecklessPark [link] [comments]  ( 40 min )
    🌎 Make your best prediction: HOW will AI systems change the world in the coming 10 years? What will be different 10 years later, because of AI systems like ChatGPT, Midjourney, Codex, Whisper and others?
    submitted by /u/DrMelbourne [link] [comments]  ( 43 min )
    I created a stream where AI bots watch movies and deliver a running commentary
    Hi all, For my weekend project I figured I would build an AI driven spiritual successor to Mystery Science Theater 3000... Stop on by and watch the AI characters watch movies and make comments! Today they are watching "The House on Haunted Hill" and "Plan 9 From Outer Space." There's still a lot to do but I'm excited to play around with this more and see how it plays out and would love some feedback! https://twitch.tv/MysteryAItheater submitted by /u/caseigl [link] [comments]  ( 42 min )
    AI That Can Put Objects from Different Pictures Together?
    I'm trying to put together pictures of my best friend (he unfortunately passed away) and myself together. We don't have any pictures of us together since our childhood /teen days. I would like to put our individually taken pics together using AI. Can someone recommend any tools? submitted by /u/mustufa2020 [link] [comments]  ( 41 min )
    Amazing "Jailbreak" Bypasses ChatGPT's Ethics Safeguards
    submitted by /u/Mental_Character7367 [link] [comments]  ( 54 min )
    Breaking: Google Invests in AnthropicAI and Claude with $300 Million Round for 10 Percent of the A.I. Lab valued at $5 Billion
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
  • Open

    Getty Images v. Stability AI – Lawsuit filing
    submitted by /u/nickb [link] [comments]  ( 40 min )
  • Open

    How to teach the agent to arrive at the goal by creating a search pattern
    Hi all, assuming the goal is to reach a ball on the table. The reward function used for this task is often: d= norm( gripper_position - ball_position ) , which will solve the problem. However, how can one teach the agent not to "directly" go to the ball, but creating a search pattern, for example, "scratching the surface with the gripper until you find the ball"? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 43 min )
    Autonomous Driving Off-Road | Swaayatt Robots | Dense Fog
    submitted by /u/shani_786 [link] [comments]  ( 41 min )
  • Open

    Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions. (arXiv:2211.01916v2 [cs.LG] UPDATED)
    We give an improved theoretical analysis of score-based generative modeling. Under a score estimate with small $L^2$ error (averaged across timesteps), we provide efficient convergence guarantees for any data distribution with second-order moment, by either employing early stopping or assuming smoothness condition on the score function of the data distribution. Our result does not rely on any log-concavity or functional inequality assumption and has a logarithmic dependence on the smoothness. In particular, we show that under only a finite second moment condition, approximating the following in reverse KL divergence in $\epsilon$-accuracy can be done in $\tilde O\left(\frac{d \log (1/\delta)}{\epsilon}\right)$ steps: 1) the variance-$\delta$ Gaussian perturbation of any data distribution; 2) data distributions with $1/\delta$-smooth score functions. Our analysis also provides a quantitative comparison between different discrete approximations and may guide the choice of discretization points in practice.
    Learning PDE Solution Operator for Continuous Modeling of Time-Series. (arXiv:2302.00854v1 [cs.LG])
    Learning underlying dynamics from data is important and challenging in many real-world scenarios. Incorporating differential equations (DEs) to design continuous networks has drawn much attention recently, however, most prior works make specific assumptions on the type of DEs, making the model specialized for particular problems. This work presents a partial differential equation (PDE) based framework which improves the dynamics modeling capability. Building upon the recent Fourier neural operator, we propose a neural operator that can handle time continuously without requiring iterative operations or specific grids of temporal discretization. A theoretical result demonstrating its universality is provided. We also uncover an intrinsic property of neural operators that improves data efficiency and model generalization by ensuring stability. Our model achieves superior accuracy in dealing with time-dependent PDEs compared to existing models. Furthermore, several numerical pieces of evidence validate that our method better represents a wide range of dynamics and outperforms state-of-the-art DE-based models in real-time-series applications. Our framework opens up a new way for a continuous representation of neural networks that can be readily adopted for real-world applications.
    Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function. (arXiv:2302.01244v1 [cs.LG])
    Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics- and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To test this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network's spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with an ensemble of probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.
    Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay. (arXiv:2205.08459v2 [cs.SD] UPDATED)
    Voice assistants overhear conversations and a consent management mechanism is required. Consent management can be implemented using speaker recognition. Users that do not give consent enrol their voice and all their further recordings are discarded. Building speaker recognition-based consent management is challenging as dynamic registration, removal, and re-registration of speakers must be efficiently handled. This work proposes a consent management system addressing the aforementioned challenges. A contrastive based training is applied to learn the underlying speaker equivariance inductive bias. The contrastive features for buckets of speakers are trained a few steps into each iteration and act as replay buffers. These features are progressively selected using a multi-strided random sampler for classification. Moreover, new methods for dynamic registration using a portion of old utterances, removal, and re-registration of speakers are proposed. The results verify memory efficiency and dynamic capabilities of the proposed methods and outperform the existing approach from the literature.
    Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy. (arXiv:2207.12141v2 [cs.LG] UPDATED)
    Model-based reinforcement learning (RL) often achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies, i.e., the sample replay buffer. However, in this paper, we observe that fitting the dynamics model under the distribution for \emph{all historical policies} does not necessarily benefit model prediction for the \emph{current policy} since the policy in use is constantly evolving over time. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how this distribution shift over historical policies affects the model learning and model rollouts. We then propose a novel dynamics model learning method, named \textit{Policy-adapted Dynamics Model Learning (PDML)}. PDML dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.
    Modelling the long-term fairness dynamics of data-driven targeted help on job seekers. (arXiv:2208.08881v2 [cs.LG] UPDATED)
    The use of data-driven decision support by public agencies is becoming more widespread and already influences the allocation of public resources. This raises ethical concerns, as it has adversely affected minorities and historically discriminated groups. In this paper, we use an approach that combines statistics and data-driven approaches with dynamical modeling to assess long-term fairness effects of labor market interventions. Specifically, we develop and use a model to investigate the impact of decisions caused by a public employment authority that selectively supports job-seekers through targeted help. The selection of who receives what help is based on a data-driven intervention model that estimates an individual's chances of finding a job in a timely manner and rests upon data that describes a population in which skills relevant to the labor market are unevenly distributed between two groups (e.g., males and females). The intervention model has incomplete access to the individual's actual skills and can augment this with knowledge of the individual's group affiliation, thus using a protected attribute to increase predictive accuracy. We assess this intervention model's dynamics -- especially fairness-related issues and trade-offs between different fairness goals -- over time and compare it to an intervention model that does not use group affiliation as a predictive feature. We conclude that in order to quantify the trade-off correctly and to assess the long-term fairness effects of such a system in the real-world, careful modeling of the surrounding labor market is indispensable.
    Do Kernel and Neural Embeddings Help in Training and Generalization?. (arXiv:1905.05095v3 [cs.LG] UPDATED)
    Recent results on optimization and generalization properties of neural networks showed that in a simple two-layer network, the alignment of the labels to the eigenvectors of the corresponding Gram matrix determines the convergence of the optimization during training. Such analyses also provide upper bounds on the generalization error. We experimentally investigate the implications of these results to deeper networks via embeddings. We regard the layers preceding the final hidden layer as producing different representations of the input data which are then fed to the two-layer model. We show that these representations improve both optimization and generalization. In particular, we investigate three kernel representations when fed to the final hidden layer: the Gaussian kernel and its approximation by random Fourier features, kernels designed to imitate representations produced by neural networks and finally an optimal kernel designed to align the data with target labels. The approximated representations induced by these kernels are fed to the neural network and the optimization and generalization properties of the final model are evaluated and compared.
    Learning-To-Ensemble by Contextual Rank Aggregation in E-Commerce. (arXiv:2107.08598v3 [cs.LG] UPDATED)
    Ensemble models in E-commerce combine predictions from multiple sub-models for ranking and revenue improvement. Industrial ensemble models are typically deep neural networks, following the supervised learning paradigm to infer conversion rate given inputs from sub-models. However, this process has the following two problems. Firstly, the point-wise scoring approach disregards the relationships between items and leads to homogeneous displayed results, while diversified display benefits user experience and revenue. Secondly, the learning paradigm focuses on the ranking metrics and does not directly optimize the revenue. In our work, we propose a new Learning-To-Ensemble (LTE) framework RAEGO, which replaces the ensemble model with a contextual Rank Aggregator (RA) and explores the best weights of sub-models by the Evaluator-Generator Optimization (EGO). To achieve the best online performance, we propose a new rank aggregation algorithm TournamentGreedy as a refinement of classic rank aggregators, which also produces the best average weighted Kendall Tau Distance (KTD) amongst all the considered algorithms with quadratic time complexity. Under the assumption that the best output list should be Pareto Optimal on the KTD metric for sub-models, we show that our RA algorithm has higher efficiency and coverage in exploring the optimal weights. Combined with the idea of Bayesian Optimization and gradient descent, we solve the online contextual Black-Box Optimization task that finds the optimal weights for sub-models given a chosen RA model. RA-EGO has been deployed in our online system and has improved the revenue significantly.
    Neural Design for Genetic Perturbation Experiments. (arXiv:2207.12805v2 [q-bio.QM] UPDATED)
    The problem of how to genetically modify cells in order to maximize a certain cellular phenotype has taken center stage in drug development over the last few years (with, for example, genetically edited CAR-T, CAR-NK, and CAR-NKT cells entering cancer clinical trials). Exhausting the search space for all possible genetic edits (perturbations) or combinations thereof is infeasible due to cost and experimental limitations. This work provides a theoretically sound framework for iteratively exploring the space of perturbations in pooled batches in order to maximize a target phenotype under an experimental budget. Inspired by this application domain, we study the problem of batch query bandit optimization and introduce the Optimistic Arm Elimination ($\mathrm{OAE}$) principle designed to find an almost optimal arm under different functional relationships between the queries (arms) and the outputs (rewards). We analyze the convergence properties of $\mathrm{OAE}$ by relating it to the Eluder dimension of the algorithm's function class and validate that $\mathrm{OAE}$ outperforms other strategies in finding optimal actions in experiments on simulated problems, public datasets well-studied in bandit contexts, and in genetic perturbation datasets when the regression model is a deep neural network. OAE also outperforms the benchmark algorithms in 3 of 4 datasets in the GeneDisco experimental planning challenge.
    Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees. (arXiv:2209.10492v2 [cs.CL] UPDATED)
    Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-Search that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-Search effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems. Supporting code available at https://github.com/swarnaHub/SummarizationPrograms
    Towards Understanding and Mitigating Dimensional Collapse in Heterogeneous Federated Learning. (arXiv:2210.00226v2 [cs.LG] UPDATED)
    Federated learning aims to train models collaboratively across different clients without the sharing of data for privacy considerations. However, one major challenge for this learning paradigm is the {\em data heterogeneity} problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects the representations of the globally aggregated models. Interestingly, we find that heterogeneous data results in the global model suffering from severe {\em dimensional collapse}, in which representations tend to reside in a lower-dimensional space instead of the ambient space. Moreover, we observe a similar phenomenon on models locally trained on each client and deduce that the dimensional collapse on the global model is inherited from local models. In addition, we theoretically analyze the gradient flow dynamics to shed light on how data heterogeneity result in dimensional collapse for local models. To remedy this problem caused by the data heterogeneity, we propose {\sc FedDecorr}, a novel method that can effectively mitigate dimensional collapse in federated learning. Specifically, {\sc FedDecorr} applies a regularization term during local training that encourages different dimensions of representations to be uncorrelated. {\sc FedDecorr}, which is implementation-friendly and computationally-efficient, yields consistent improvements over baselines on standard benchmark datasets. Code: https://github.com/Yujun-Shi/FedCLS.
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v2 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
    Analysis of Knowledge Transfer in Kernel Regime. (arXiv:2003.13438v3 [cs.LG] UPDATED)
    Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
    A Light Recipe to Train Robust Vision Transformers. (arXiv:2209.07399v2 [cs.CV] UPDATED)
    In this paper, we ask whether Vision Transformers (ViTs) can serve as an underlying architecture for improving the adversarial robustness of machine learning models against evasion attacks. While earlier works have focused on improving Convolutional Neural Networks, we show that also ViTs are highly suitable for adversarial training to achieve competitive performance. We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset. The canonical training recipe for ViTs recommends strong data augmentation, in part to compensate for the lack of vision inductive bias of attention modules, when compared to convolutions. We show that this recipe achieves suboptimal performance when used for adversarial training. In contrast, we find that omitting all heavy data augmentation, and adding some additional bag-of-tricks ($\varepsilon$-warmup and larger weight decay), significantly boosts the performance of robust ViTs. We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k. Additionally, investigating the reasons for the robustness of our models, we show that it is easier to generate strong attacks during training when using our recipe and that this leads to better robustness at test time. Finally, we further study one consequence of adversarial training by proposing a way to quantify the semantic nature of adversarial perturbations and highlight its correlation with the robustness of the model. Overall, we recommend that the community should avoid translating the canonical training recipes in ViTs to robust training and rethink common training choices in the context of adversarial training.
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v2 [cs.LG] UPDATED)
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These changes result in distribution shifts across environments. Attributing model performance changes to specific shifts is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.
    Safe Optimization of an Industrial Refrigeration Process Using an Adaptive and Explorative Framework. (arXiv:2211.13019v2 [math.OC] UPDATED)
    Many industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown process characteristics, real-time optimization becomes challenging, particularly for the satisfaction of safety constraints. In this paper, we demonstrate the application of an adaptive and explorative real-time optimization framework to an industrial refrigeration process, where we learn the process characteristics through changes in process control targets and through exploration to satisfy safety constraints. We quantify the uncertainty in unknown compressor characteristics of the refrigeration plant by using Gaussian processes and incorporate this uncertainty into the objective function of the real-time optimization problem as a weighted cost term. We adaptively control the weight of this term to drive exploration. The results of our simulation experiments indicate the proposed approach can help to increase the energy efficiency of the considered refrigeration process, closely approximating the performance of a solution that has complete information about the compressor performance characteristics.
    High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance. (arXiv:2302.00999v1 [math.OC])
    During recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimization methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assumptions such as the boundedness of the gradient noise variance or of the objective's gradient itself. In this paper, we propose several algorithms with high-probability convergence results under less restrictive assumptions. In particular, we derive new high-probability convergence results under the assumption that the gradient/operator noise has bounded central $\alpha$-th moment for $\alpha \in (1,2]$ in the following setups: (i) smooth non-convex / Polyak-Lojasiewicz / convex / strongly convex / quasi-strongly convex minimization problems, (ii) Lipschitz / star-cocoercive and monotone / quasi-strongly monotone variational inequalities. These results justify the usage of the considered methods for solving problems that do not fit standard functional classes studied in stochastic optimization.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v5 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    Multi-agent Reinforcement Learning with Graph Q-Networks for Antenna Tuning. (arXiv:2302.01199v1 [cs.NI])
    Future generations of mobile networks are expected to contain more and more antennas with growing complexity and more parameters. Optimizing these parameters is necessary for ensuring the good performance of the network. The scale of mobile networks makes it challenging to optimize antenna parameters using manual intervention or hand-engineered strategies. Reinforcement learning is a promising technique to address this challenge but existing methods often use local optimizations to scale to large network deployments. We propose a new multi-agent reinforcement learning algorithm to optimize mobile network configurations globally. By using a value decomposition approach, our algorithm can be trained from a global reward function instead of relying on an ad-hoc decomposition of the network performance across the different cells. The algorithm uses a graph neural network architecture which generalizes to different network topologies and learns coordination behaviors. We empirically demonstrate the performance of the algorithm on an antenna tilt tuning problem and a joint tilt and power control problem in a simulated environment.
    Robust Estimation under the Wasserstein Distance. (arXiv:2302.01237v1 [stat.ML])
    We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.
    From Traditional Adaptive Data Caching to Adaptive Context Caching: A Survey. (arXiv:2211.11259v2 [cs.HC] UPDATED)
    Context data is in demand more than ever with the rapid increase in the development of many context-aware Internet of Things applications. Research in context and context-awareness is being conducted to broaden its applicability in light of many practical and technical challenges. One of the challenges is improving performance when responding to large number of context queries. Context Management Platforms that infer and deliver context to applications measure this problem using Quality of Service (QoS) parameters. Although caching is a proven way to improve QoS, transiency of context and features such as variability, heterogeneity of context queries pose an additional real-time cost management problem. This paper presents a critical survey of state-of-the-art in adaptive data caching with the objective of developing a body of knowledge in cost- and performance-efficient adaptive caching strategies. We comprehensively survey a large number of research publications and evaluate, compare, and contrast different techniques, policies, approaches, and schemes in adaptive caching. Our critical analysis is motivated by the focus on adaptively caching context as a core research problem. A formal definition for adaptive context caching is then proposed, followed by identified features and requirements of a well-designed, objective optimal adaptive context caching strategy.
    Uncertainty in Fairness Assessment: Maintaining Stable Conclusions Despite Fluctuations. (arXiv:2302.01079v1 [cs.LG])
    Several recent works encourage the use of a Bayesian framework when assessing performance and fairness metrics of a classification algorithm in a supervised setting. We propose the Uncertainty Matters (UM) framework that generalizes a Beta-Binomial approach to derive the posterior distribution of any criteria combination, allowing stable performance assessment in a bias-aware setting.We suggest modeling the confusion matrix of each demographic group using a Multinomial distribution updated through a Bayesian procedure. We extend UM to be applicable under the popular K-fold cross-validation procedure. Experiments highlight the benefits of UM over classical evaluation frameworks regarding informativeness and stability.
    A Survey on Efficient Training of Transformers. (arXiv:2302.01107v1 [cs.LG])
    Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v3 [stat.ML] UPDATED)
    This paper presents new machine learning approaches to approximate the solutions of optimal stopping problems. The key idea of these methods is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees are provided. Our randomized reinforcement learning approach and randomized recurrent neural network approach outperform the state-of-the-art and other relevant machine learning approaches in Markovian and non-Markovian examples, respectively. In particular, we test our approaches on Black-Scholes, Heston, rough Heston and fractional Brownian motion. Moreover, we show that they can also be used to efficiently compute Greeks of American options.
    Epistemic Neural Networks. (arXiv:2107.08924v7 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. The epinet does not fit the traditional framework of Bayesian neural networks. To accommodate development of approaches beyond BNNs, such as the epinet, we introduce the epistemic neural network (ENN) as an interface for models that produce joint predictions.
    LogLG: Weakly Supervised Log Anomaly Detection via Log-Event Graph Construction. (arXiv:2208.10833v4 [cs.SE] UPDATED)
    Fully supervised log anomaly detection methods suffer the heavy burden of annotating massive unlabeled log data. Recently, many semi-supervised methods have been proposed to reduce annotation costs with the help of parsed templates. However, these methods consider each keyword independently, which disregards the correlation between keywords and the contextual relationships among log sequences. In this paper, we propose a novel weakly supervised log anomaly detection framework, named LogLG, to explore the semantic connections among keywords from sequences. Specifically, we design an end-to-end iterative process, where the keywords of unlabeled logs are first extracted to construct a log-event graph. Then, we build a subgraph annotator to generate pseudo labels for unlabeled log sequences. To ameliorate the annotation quality, we adopt a self-supervised task to pre-train a subgraph annotator. After that, a detection model is trained with the generated pseudo labels. Conditioned on the classification results, we re-extract the keywords from the log sequences and update the log-event graph for the next iteration. Experiments on five benchmarks validate the effectiveness of LogLG for detecting anomalies on unlabeled log data and demonstrate that LogLG, as the state-of-the-art weakly supervised method, achieves significant performance improvements compared to existing methods.
    OntoED: Low-resource Event Detection with Ontology Embedding. (arXiv:2105.10922v4 [cs.IR] CROSS LISTED)
    Event Detection (ED) aims to identify event trigger words from a given text and classify it into an event type. Most of current methods to ED rely heavily on training instances, and almost ignore the correlation of event types. Hence, they tend to suffer from data scarcity and fail to handle new unseen event types. To address these problems, we formulate ED as a process of event ontology population: linking event instances to pre-defined event types in event ontology, and propose a novel ED framework entitled OntoED with ontology embedding. We enrich event ontology with linkages among event types, and further induce more event-event correlations. Based on the event ontology, OntoED can leverage and propagate correlation knowledge, particularly from data-rich to data-poor event types. Furthermore, OntoED can be applied to new unseen event types, by establishing linkages to existing ones. Experiments indicate that OntoED is more predominant and robust than previous approaches to ED, especially in data-scarce scenarios.
    A general Markov decision process formalism for action-state entropy-regularized reward maximization. (arXiv:2302.01098v1 [cs.LG])
    Previous work has separately addressed different forms of action, state and action-state entropy regularization, pure exploration and space occupation. These problems have become extremely relevant for regularization, generalization, speeding up learning and providing robust solutions at unprecedented levels. However, solutions of those problems are hectic, ranging from convex and non-convex optimization, and unconstrained optimization to constrained optimization. Here we provide a general dual function formalism that transforms the constrained optimization problem into an unconstrained convex one for any mixture of action and state entropies. The cases with pure action entropy and pure state entropy are understood as limits of the mixture.
    Learning Globally Smooth Functions on Manifolds. (arXiv:2210.00301v3 [cs.LG] UPDATED)
    Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. This work combines techniques from semi-infinite constrained learning and manifold regularization to learn representations that are globally smooth on a manifold. To do so, it shows that under typical conditions the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. It is shown that under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Experiments on real world data illustrate the advantages of the proposed method relative to existing alternatives.
    Boundary-Aware Uncertainty for Feature Attribution Explainers. (arXiv:2210.02419v3 [cs.LG] UPDATED)
    Post-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications, precipitating a need for reliable explanations. Nevertheless, recent works have shown that many existing methods can be inconsistent or lack robustness. In addition, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. In this work, we propose a novel geodesic-based kernel which captures the complexity of the target black-box decision boundary. We show theoretically that the proposed kernel similarity increases with the complexity of the decision boundary. In addition, we introduce the Gaussian Process Explanation UnCertainty (GPEC) framework, which generates a unified uncertainty estimate combining decision boundary-aware uncertainty with existing explanation uncertainty methods. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method. Empirical results on multiple tabular and image datasets show that the GPEC uncertainty estimate improves understanding of explanations as compared to existing methods.
    Encouraging Intra-Class Diversity Through a Reverse Contrastive Loss for Better Single-Source Domain Generalization. (arXiv:2106.07916v2 [cs.CV] CROSS LISTED)
    Traditional deep learning algorithms often fail to generalize when they are tested outside of the domain of the training data. The issue can be mitigated by using unlabeled data from the target domain at training time, but because data distributions can change dynamically in real-life applications once a learned model is deployed, it is critical to create networks robust to unknown and unforeseen domain shifts. In this paper we focus on one of the reasons behind the inability of neural networks to be so: deep networks focus only on the most obvious, potentially spurious, clues to make their predictions and are blind to useful but slightly less efficient or more complex patterns. This behaviour has been identified and several methods partially addressed the issue. To investigate their effectiveness and limits, we first design a publicly available MNIST-based benchmark to precisely measure the ability of an algorithm to find the ''hidden'' patterns. Then, we evaluate state-of-the-art algorithms through our benchmark and show that the issue is largely unsolved. Finally, we propose a partially reversed contrastive loss to encourage intra-class diversity and find less strongly correlated patterns, whose efficiency is demonstrated by our experiments.
    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models. (arXiv:2205.07999v2 [stat.ML] UPDATED)
    Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v2 [cs.LG] UPDATED)
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.
    Mini-Batch Learning Strategies for modeling long term temporal dependencies: A study in environmental applications. (arXiv:2210.08347v2 [cs.LG] UPDATED)
    In many environmental applications, recurrent neural networks (RNNs) are often used to model physical variables with long temporal dependencies. However, due to mini-batch training, temporal relationships between training segments within the batch (intra-batch) as well as between batches (inter-batch) are not considered, which can lead to limited performance. Stateful RNNs aim to address this issue by passing hidden states between batches. Since Stateful RNNs ignore intra-batch temporal dependency, there exists a trade-off between training stability and capturing temporal dependency. In this paper, we provide a quantitative comparison of different Stateful RNN modeling strategies, and propose two strategies to enforce both intra- and inter-batch temporal dependency. First, we extend Stateful RNNs by defining a batch as a temporally ordered set of training segments, which enables intra-batch sharing of temporal information. While this approach significantly improves the performance, it leads to much larger training times due to highly sequential training. To address this issue, we further propose a new strategy which augments a training segment with an initial value of the target variable from the timestep right before the starting of the training segment. In other words, we provide an initial value of the target variable as additional input so that the network can focus on learning changes relative to that initial value. By using this strategy, samples can be passed in any order (mini-batch training) which significantly reduces the training time while maintaining the performance. In demonstrating our approach in hydrological modeling, we observe that the most significant gains in predictive accuracy occur when these methods are applied to state variables whose values change more slowly, such as soil water and snowpack, rather than continuously moving flux variables such as streamflow.
    An Advantage Using Feature Selection with a Quantum Annealer. (arXiv:2211.09756v3 [quant-ph] UPDATED)
    Feature selection is a technique in statistical prediction modeling that identifies features in a record with a strong statistical connection to the target variable. Excluding features with a weak statistical connection to the target variable in training not only drops the dimension of the data, which decreases the time complexity of the algorithm, it also decreases noise within the data which assists in avoiding overfitting. In all, feature selection assists in training a robust statistical model that performs well and is stable. Given the lack of scalability in classical computation, current techniques only consider the predictive power of the feature and not redundancy between the features themselves. Recent advancements in feature selection that leverages quantum annealing (QA) gives a scalable technique that aims to maximize the predictive power of the features while minimizing redundancy. As a consequence, it is expected that this algorithm would assist in the bias/variance trade-off yielding better features for training a statistical model. This paper tests this intuition against classical methods by utilizing open-source data sets and evaluate the efficacy of each trained statistical model well-known prediction algorithms. The numerical results display an advantage utilizing the features selected from the algorithm that leveraged QA.
    Optimization-Based Separations for Neural Networks. (arXiv:2112.02393v3 [cs.LG] UPDATED)
    Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are $d$-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than $\Omega(d^{-4})$, nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.
    Neural Estimation of the Rate-Distortion Function With Applications to Operational Source Coding. (arXiv:2204.01612v2 [cs.IT] UPDATED)
    A fundamental question in designing lossy data compression schemes is how well one can do in comparison with the rate-distortion function, which describes the known theoretical limits of lossy compression. Motivated by the empirical success of deep neural network (DNN) compressors on large, real-world data, we investigate methods to estimate the rate-distortion function on such data, which would allow comparison of DNN compressors with optimality. While one could use the empirical distribution of the data and apply the Blahut-Arimoto algorithm, this approach presents several computational challenges and inaccuracies when the datasets are large and high-dimensional, such as the case of modern image datasets. Instead, we re-formulate the rate-distortion objective, and solve the resulting functional optimization problem using neural networks. We apply the resulting rate-distortion estimator, called NERD, on popular image datasets, and provide evidence that NERD can accurately estimate the rate-distortion function. Using our estimate, we show that the rate-distortion achievable by DNN compressors are within several bits of the rate-distortion function for real-world datasets. Additionally, NERD provides access to the rate-distortion achieving channel, as well as samples from its output marginal. Therefore, using recent results in reverse channel coding, we describe how NERD can be used to construct an operational one-shot lossy compression scheme with guarantees on the achievable rate and distortion. Experimental results demonstrate competitive performance with DNN compressors.
    Information-theoretic limitations of data-based price discrimination. (arXiv:2204.12723v2 [cs.GT] UPDATED)
    This paper studies third-degree price discrimination (3PD) based on a random sample of valuation and covariate data, where the covariate is continuous, and the distribution of the data is unknown to the seller. The main results of this paper are twofold. The first set of results is pricing strategy independent and reveals the fundamental information-theoretic limitation of any data-based pricing strategy in revenue generation for two cases: 3PD and uniform pricing. The second set of results proposes the $K$-markets empirical revenue maximization (ERM) strategy and shows that the $K$-markets ERM and the uniform ERM strategies achieve the optimal rate of convergence in revenue to that generated by their respective true-distribution 3PD and uniform pricing optima. Our theoretical and numerical results suggest that the uniform (i.e., $1$-market) ERM strategy generates a larger revenue than the $K$-markets ERM strategy when the sample size is small enough, and vice versa.
    Revisiting Simple Regret: Fast Rates for Returning a Good Arm. (arXiv:2210.16913v2 [cs.LG] UPDATED)
    Simple regret is a natural and parameter-free performance criterion for pure exploration in multi-armed bandits yet is less popular than the probability of missing the best arm or an $\epsilon$-good arm, perhaps due to lack of easy ways to characterize it. In this paper, we make significant progress on minimizing simple regret in both data-rich ($T\ge n$) and data-poor regime ($T \le n$) where $n$ is the number of arms, and $T$ is the number of samples. At its heart is our improved instance-dependent analysis of the well-known Sequential Halving (SH) algorithm, where we bound the probability of returning an arm whose mean reward is not within $\epsilon$ from the best (i.e., not $\epsilon$-good) for \textit{any} choice of $\epsilon>0$, although $\epsilon$ is not an input to SH. Our bound not only leads to an optimal worst-case simple regret bound of $\sqrt{n/T}$ up to logarithmic factors but also essentially matches the instance-dependent lower bound for returning an $\epsilon$-good arm reported by Katz-Samuels and Jamieson (2020). For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once. Our empirical study shows that BSH outperforms existing methods on real-world tasks.
    Correlated Initialization for Correlated Data. (arXiv:2003.04422v2 [cs.LG] UPDATED)
    Spatial data exhibits the property that nearby points are correlated. This also holds for learnt representations across layers, but not for commonly used weight initialization methods. Our theoretical analysis quantifies the learning behavior of weights of a single spatial filter. It is thus in contrast to a large body of work that discusses statistical properties of weights. It shows that uncorrelated initialization (i) might lead to poor convergence behavior and (ii) training of (some) parameters is likely subject to slow convergence. Empirical analysis shows that these findings for a single spatial filter extend to networks with many spatial filters. The impact of (correlated) initialization depends strongly on learning rates and l2-regularization.
    The KFIoU Loss for Rotated Object Detection. (arXiv:2201.12558v5 [cs.CV] UPDATED)
    Differing from the well-developed horizontal object detection area whereby the computing-friendly IoU based loss is readily adopted and well fits with the detection metrics. In contrast, rotation detectors often involve a more complicated loss based on SkewIoU which is unfriendly to gradient-based training. In this paper, we propose an effective approximate SkewIoU loss based on Gaussian modeing and Kalman filter, which mainly consists of two items. The first term is a scale-insensitive center point loss, which is used to quickly get the center points between bounding boxes closer to assist the second term. In the distance-independent second term, Kalman filter is adopted to inherently mimic the mechanism of SkewIoU by its definition, and show its alignment with the SkewIoU loss at trend-level within a certain distance (i.e. within 9 pixels). This is in contrast to recent Gaussian modeling based rotation detectors e.g. GWD loss and KLD loss that involve a human-specified distribution distance metric which require additional hyperparameter tuning that vary across datasets and detectors. The resulting new loss called KFIoU loss is easier to implement and works better compared with exact SkewIoU loss, thanks to its full differentiability and ability to handle the non-overlapping cases. We further extend our technique to the 3-D case which also suffers from the same issues as 2-D detection. Extensive results on various public datasets (2-D/3-D, aerial/text/face images) with different base detectors show the effectiveness of our approach.
    Unravelling the Performance of Physics-informed Graph Neural Networks for Dynamical Systems. (arXiv:2211.05520v2 [cs.LG] UPDATED)
    Recently, graph neural networks have been gaining a lot of attention to simulate dynamical systems due to their inductive nature leading to zero-shot generalizability. Similarly, physics-informed inductive biases in deep-learning frameworks have been shown to give superior performance in learning the dynamics of physical systems. There is a growing volume of literature that attempts to combine these two approaches. Here, we evaluate the performance of thirteen different graph neural networks, namely, Hamiltonian and Lagrangian graph neural networks, graph neural ODE, and their variants with explicit constraints and different architectures. We briefly explain the theoretical formulation highlighting the similarities and differences in the inductive biases and graph architecture of these systems. We evaluate these models on spring, pendulum, gravitational, and 3D deformable solid systems to compare the performance in terms of rollout error, conserved quantities such as energy and momentum, and generalizability to unseen system sizes. Our study demonstrates that GNNs with additional inductive biases, such as explicit constraints and decoupling of kinetic and potential energies, exhibit significantly enhanced performance. Further, all the physics-informed GNNs exhibit zero-shot generalizability to system sizes an order of magnitude larger than the training system, thus providing a promising route to simulate large-scale realistic systems.
    Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection. (arXiv:1910.11621v2 [cs.CL] CROSS LISTED)
    Event detection (ED), a sub-task of event extraction, involves identifying triggers and categorizing event mentions. Existing methods primarily rely upon supervised learning and require large-scale labeled event datasets which are unfortunately not readily available in many real-life applications. In this paper, we consider and reformulate the ED task with limited labeled data as a Few-Shot Learning problem. We propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mentions once, our model is more robust and is capable of distilling contextual information from event mentions for multiple times due to the multi-hop mechanism of DMNs. The experiments show that DMB-PN not only deals with sample scarcity better than a series of baseline models but also performs more robustly when the variety of event types is relatively large and the instance quantity is extremely small.
    Augmentation Component Analysis: Modeling Similarity via the Augmentation Overlaps. (arXiv:2206.00471v2 [cs.LG] UPDATED)
    Self-supervised learning aims to learn a embedding space where semantically similar samples are close. Contrastive learning methods pull views of samples together and push different samples away, which utilizes semantic invariance of augmentation but ignores the relationship between samples. To better exploit the power of augmentation, we observe that semantically similar samples are more likely to have similar augmented views. Therefore, we can take the augmented views as a special description of a sample. In this paper, we model such a description as the augmentation distribution and we call it augmentation feature. The similarity in augmentation feature reflects how much the views of two samples overlap and is related to their semantical similarity. Without computational burdens to explicitly estimate values of the augmentation feature, we propose Augmentation Component Analysis (ACA) with a contrastive-like loss to learn principal components and an on-the-fly projection loss to embed data. ACA equals an efficient dimension reduction by PCA and extracts low-dimensional embeddings, theoretically preserving the similarity of augmentation distribution between samples. Empirical results show our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.
    Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Communication Compression. (arXiv:2210.13277v3 [cs.LG] UPDATED)
    In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and two-way communication with a distant orchestrating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: our algorithm benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
    $IC^3$: Image Captioning by Committee Consensus. (arXiv:2302.01328v1 [cs.CV])
    If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" ($IC^3$), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by $IC^3$ at least as helpful as baseline SOTA models more than two thirds of the time, and $IC^3$ captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee
    Lower Bounds for Learning in Revealing POMDPs. (arXiv:2302.01333v1 [cs.LG])
    This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.
    Dual PatchNorm. (arXiv:2302.01327v1 [cs.CV])
    We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms), before and after the patch embedding layer in Vision Transformers. We demonstrate that Dual PatchNorm outperforms the result of exhaustive search for alternative LayerNorm placement strategies in the Transformer block itself. In our experiments, incorporating this trivial modification, often leads to improved accuracy over well-tuned Vision Transformers and never hurts.
    Bayesian Metric Learning for Uncertainty Quantification in Image Retrieval. (arXiv:2302.01332v1 [cs.LG])
    We propose the first Bayesian encoder for metric learning. Rather than relying on neural amortization as done in prior works, we learn a distribution over the network weights with the Laplace Approximation. We actualize this by first proving that the contrastive loss is a valid log-posterior. We then propose three methods that ensure a positive definite Hessian. Lastly, we present a novel decomposition of the Generalized Gauss-Newton approximation. Empirically, we show that our Laplacian Metric Learner (LAM) estimates well-calibrated uncertainties, reliably detects out-of-distribution examples, and yields state-of-the-art predictive performance.
    The Value of Out-of-Distribution Data. (arXiv:2208.10967v3 [cs.LG] UPDATED)
    We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.
    Computational Discovery of Microstructured Composites with Optimal Strength-Toughness Trade-Offs. (arXiv:2302.01078v1 [cond-mat.mtrl-sci])
    The conflict between strength and toughness is a fundamental problem in engineering materials design. However, systematic discovery of microstructured composites with optimal strength-toughness trade-offs has never been demonstrated due to the discrepancies between simulation and reality and the lack of data-efficient exploration of the entire Pareto front. Here, we report a widely applicable pipeline harnessing physical experiments, numerical simulations, and artificial neural networks to efficiently discover microstructured designs that are simultaneously tough and strong. Using a physics-based simulator with moderate complexity, our strategy runs a data-driven proposal-validation workflow in a nested-loop fashion to bridge the gap between simulation and reality in high sample efficiency. Without any prescribed expert knowledge of materials design, our approach automatically identifies existing toughness enhancement mechanisms that were traditionally discovered through trial-and-error or biomimicry. We provide a blueprint for the computational discovery of optimal designs, which inverts traditional scientific approaches, and is applicable to a wide range of research problems beyond composites, including polymer chemistry, fluid dynamics, meteorology, and robotics.
    Error estimates for physics informed neural networks approximating the Navier-Stokes equations. (arXiv:2203.09346v2 [math.NA] UPDATED)
    We prove rigorous bounds on the errors resulting from the approximation of the incompressible Navier-Stokes equations with (extended) physics informed neural networks. We show that the underlying PDE residual can be made arbitrarily small for tanh neural networks with two hidden layers. Moreover, the total error can be estimated in terms of the training error, network size and number of quadrature points. The theory is illustrated with numerical experiments.
    Predicting Molecule-Target Interaction by Learning Biomedical Network and Molecule Representations. (arXiv:2302.00981v1 [cs.LG])
    The study of molecule-target interaction is quite important for drug discovery in terms of target identification, pathway study, drug-drug interaction, etc. Most existing methodologies utilize either biomedical network information or molecule structural features to predict potential interaction link. However, the biomedical network information based methods usually suffer from cold start problem, while structure based methods often give limited performance due to the structure/interaction assumption and data quality. To address these issues, we propose a pseudo-siamese Graph Neural Network method, namely MTINet+, which learns both biomedical network topological and molecule structural/chemical information as representations to predict potential interaction of given molecule and target pair. In MTINet+, 1-hop subgraphs of given molecule and target pair are extracted from known interaction of biomedical network as topological information, meanwhile the molecule structural and chemical attributes are processed as molecule information. MTINet+ learns these two types of information as embedding features for predicting the pair link. In the experiments of different molecule-target interaction tasks, MTINet+ significantly outperforms over the state-of-the-art baselines. In addition, in our designed network sparsity experiments , MTINet+ shows strong robustness against different sparse biomedical networks.
    Bayesian Optimization of Multiple Objectives with Different Latencies. (arXiv:2302.01310v1 [stat.ML])
    Multi-objective Bayesian optimization aims to find the Pareto front of optimal trade-offs between a set of expensive objectives while collecting as few samples as possible. In some cases, it is possible to evaluate the objectives separately, and a different latency or evaluation cost can be associated with each objective. This presents an opportunity to learn the Pareto front faster by evaluating the cheaper objectives more frequently. We propose a scalarization based knowledge gradient acquisition function which accounts for the different evaluation costs of the objectives. We prove consistency of the algorithm and show empirically that it significantly outperforms a benchmark algorithm which always evaluates both objectives.
    Efficient Privacy-Preserving Stochastic Nonconvex Optimization. (arXiv:1910.13659v3 [cs.LG] UPDATED)
    While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while improves the best previous utility guarantee given by Wang et al. (NeurIPS 2017). Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.
    Best Possible Q-Learning. (arXiv:2302.01188v1 [cs.LG])
    Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.
    Factor Fields: A Unified Framework for Neural Fields and Beyond. (arXiv:2302.01226v1 [cs.CV])
    We present Factor Fields, a novel framework for modeling and representing signals. Factor Fields decomposes a signal into a product of factors, each of which is represented by a neural or regular field representation operating on a coordinate transformed input signal. We show that this decomposition yields a unified framework that generalizes several recent signal representations including NeRF, PlenOxels, EG3D, Instant-NGP, and TensoRF. Moreover, the framework allows for the creation of powerful new signal representations, such as the Coefficient-Basis Factorization (CoBaFa) which we propose in this paper. As evidenced by our experiments, CoBaFa leads to improvements over previous fast reconstruction methods in terms of the three critical goals in neural signal representation: approximation quality, compactness and efficiency. Experimentally, we demonstrate that our representation achieves better image approximation quality on 2D image regression tasks, higher geometric quality when reconstructing 3D signed distance fields and higher compactness for radiance field reconstruction tasks compared to previous fast reconstruction methods. Besides, our CoBaFa representation enables generalization by sharing the basis across signals during training, enabling generalization tasks such as image regression with sparse observations and few-shot radiance field reconstruction.
    Diagrammatization: Rationalizing with diagrammatic AI explanations for abductive reasoning on hypotheses. (arXiv:2302.01241v1 [cs.AI])
    Many visualizations have been developed for explainable AI (XAI), but they often require further reasoning by users to interpret. We argue that XAI should support abductive reasoning - inference to the best explanation - with diagrammatic reasoning to convey hypothesis generation and evaluation. Inspired by Peircean diagrammatic reasoning and the 5-step abduction process, we propose Diagrammatization, an approach to provide diagrammatic, abductive explanations based on domain hypotheses. We implemented DiagramNet for a clinical application to predict diagnoses from heart auscultation, and explain with shape-based murmur diagrams. In modeling studies, we found that DiagramNet not only provides faithful murmur shape explanations, but also has better prediction performance than baseline models. We further demonstrate the usefulness of diagrammatic explanations in a qualitative user study with medical students, showing that clinically-relevant, diagrammatic explanations are preferred over technical saliency map explanations. This work contributes insights into providing domain-conventional abductive explanations for user-centric XAI.
    Geometric Deep Learning for Autonomous Driving: Unlocking the Power of Graph Neural Networks With CommonRoad-Geometric. (arXiv:2302.01259v1 [cs.LG])
    Heterogeneous graphs offer powerful data representations for traffic, given their ability to model the complex interaction effects among a varying number of traffic participants and the underlying road infrastructure. With the recent advent of graph neural networks (GNNs) as the accompanying deep learning framework, the graph structure can be efficiently leveraged for various machine learning applications such as trajectory prediction. As a first of its kind, our proposed Python framework offers an easy-to-use and fully customizable data processing pipeline to extract standardized graph datasets from traffic scenarios. Providing a platform for GNN-based autonomous driving research, it improves comparability between approaches and allows researchers to focus on model implementation instead of dataset curation.
    Neuro Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and Concept Rehearsal. (arXiv:2302.01242v1 [cs.LG])
    We introduce Neuro-Symbolic Continual Learning, where a model has to solve a sequence of neuro-symbolic tasks, that is, it has to map sub-symbolic inputs to high-level concepts and compute predictions by reasoning consistently with prior knowledge. Our key observation is that neuro-symbolic tasks, although different, often share concepts whose semantics remains stable over time. Traditional approaches fall short: existing continual strategies ignore knowledge altogether, while stock neuro-symbolic architectures suffer from catastrophic forgetting. We show that leveraging prior knowledge by combining neuro-symbolic architectures with continual strategies does help avoid catastrophic forgetting, but also that doing so can yield models affected by reasoning shortcuts. These undermine the semantics of the acquired concepts, even when detailed prior knowledge is provided upfront and inference is exact, and in turn continual performance. To overcome these issues, we introduce COOL, a COncept-level cOntinual Learning strategy tailored for neuro-symbolic continual problems that acquires high-quality concepts and remembers them over time. Our experiments on three novel benchmarks highlights how COOL attains sustained high performance on neuro-symbolic continual learning tasks in which other strategies fail.
    Federated Analytics: A survey. (arXiv:2302.01326v1 [cs.LG])
    Federated analytics (FA) is a privacy-preserving framework for computing data analytics over multiple remote parties (e.g., mobile devices) or silo-ed institutional entities (e.g., hospitals, banks) without sharing the data among parties. Motivated by the practical use cases of federated analytics, we follow a systematic discussion on federated analytics in this article. In particular, we discuss the unique characteristics of federated analytics and how it differs from federated learning. We also explore a wide range of FA queries and discuss various existing solutions and potential use case applications for different FA queries.
    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v1 [stat.ML])
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on downstream datasets that are (i) relatively large, and (ii) similar in distribution to the pretraining data. However, in many applications including personalization, it is crucial to perform well in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and on images from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases by as much as 32$\times$ for CIFAR-100 at $\epsilon=1$. We also find that few-shot non-private models are highly susceptible to membership inference attacks. DP provides clear mitigation against the attacks, but a small $\epsilon$ is required to effectively prevent them. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR federated learning benchmark.
    Normalizing Flow Ensembles for Rich Aleatoric and Epistemic Uncertainty Modeling. (arXiv:2302.01312v1 [cs.LG])
    In this work, we demonstrate how to reliably estimate epistemic uncertainty while maintaining the flexibility needed to capture complicated aleatoric distributions. To this end, we propose an ensemble of Normalizing Flows (NF), which are state-of-the-art in modeling aleatoric uncertainty. The ensembles are created via sets of fixed dropout masks, making them less expensive than creating separate NF models. We demonstrate how to leverage the unique structure of NFs, base distributions, to estimate aleatoric uncertainty without relying on samples, provide a comprehensive set of baselines, and derive unbiased estimates for differential entropy. The methods were applied to a variety of experiments, commonly used to benchmark aleatoric and epistemic uncertainty estimation: 1D sinusoidal data, 2D windy grid-world ($\it{Wet Chicken}$), $\it{Pendulum}$, and $\it{Hopper}$. In these experiments, we setup an active learning framework and evaluate each model's capability at measuring aleatoric and epistemic uncertainty. The results show the advantages of using NF ensembles in capturing complicated aleatoric while maintaining accurate epistemic uncertainty estimates.
    UW-CVGAN: UnderWater Image Enhancement with Capsules Vectors Quantization. (arXiv:2302.01144v1 [cs.CV])
    The degradation in the underwater images is due to wavelength-dependent light attenuation, scattering, and to the diversity of the water types in which they are captured. Deep neural networks take a step in this field, providing autonomous models able to achieve the enhancement of underwater images. We introduce Underwater Capsules Vectors GAN UWCVGAN based on the discrete features quantization paradigm from VQGAN for this task. The proposed UWCVGAN combines an encoding network, which compresses the image into its latent representation, with a decoding network, able to reconstruct the enhancement of the image from the only latent representation. In contrast with VQGAN, UWCVGAN achieves feature quantization by exploiting the clusterization ability of capsule layer, making the model completely trainable and easier to manage. The model obtains enhanced underwater images with high quality and fine details. Moreover, the trained encoder is independent of the decoder giving the possibility to be embedded onto the collector as compressing algorithm to reduce the memory space required for the images, of factor $3\times$. \myUWCVGAN{ }is validated with quantitative and qualitative analysis on benchmark datasets, and we present metrics results compared with the state of the art.
    A comparative study of statistical and machine learning models on near-real-time daily emissions prediction. (arXiv:2302.01152v1 [cs.AI])
    The rapid ascent in carbon dioxide emissions is a major cause of global warming and climate change, which pose a huge threat to human survival and impose far-reaching influence on the global ecosystem. Therefore, it is very necessary to effectively control carbon dioxide emissions by accurately predicting and analyzing the change trend timely, so as to provide a reference for carbon dioxide emissions mitigation measures. This paper is aiming to select a suitable model to predict the near-real-time daily emissions based on univariate daily time-series data from January 1st, 2020 to September 30st, 2022 of all sectors (Power, Industry, Ground Transport, Residential, Domestic Aviation, International Aviation) in China. We proposed six prediction models, which including three statistical models: Grey prediction (GM(1,1)), autoregressive integrated moving average (ARIMA) and seasonal autoregressive integrated moving average with exogenous factors (SARIMAX); three machine learning models: artificial neural network (ANN), random forest (RF) and long short term memory (LSTM). To evaluate the performance of these models, five criteria: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Coefficient of Determination () are imported and discussed in detail. In the results, three machine learning models perform better than that three statistical models, in which LSTM model performs the best on five criteria values for daily emissions prediction with the 3.5179e-04 MSE value, 0.0187 RMSE value, 0.0140 MAE value, 14.8291% MAPE value and 0.9844 value.
    Temporal fusion transformer using variational mode decomposition for wind power forecasting. (arXiv:2302.01222v1 [cs.LG])
    The power output of a wind turbine depends on a variety of factors, including wind speed at different heights, wind direction, temperature and turbine properties. Wind speed and direction, in particular, have complex cycles and fluctuate dramatically, leading to large uncertainties in wind power output. This study uses variational mode decomposition (VMD) to decompose the wind power series and Temporal fusion transformer (TFT) to forecast wind power for the next 1h, 3h and 6h. The experimental results show that VMD outperforms other decomposition algorithms and the TFT model outperforms other decomposition models.
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v1 [stat.ML])
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where the equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. Our method constitutes an important step towards developing general, transferable algorithms for accelerating MD.
    Dual Propagation: Accelerating Contrastive Hebbian Learning with Dyadic Neurons. (arXiv:2302.01228v1 [cs.LG])
    Activity difference based learning algorithms-such as contrastive Hebbian learning and equilibrium propagation-have been proposed as biologically plausible alternatives to error back-propagation. However, on traditional digital chips these algorithms suffer from having to solve a costly inference problem twice, making these approaches more than two orders of magnitude slower than back-propagation. In the analog realm equilibrium propagation may be promising for fast and energy efficient learning, but states still need to be inferred and stored twice. Inspired by lifted neural networks and compartmental neuron models we propose a simple energy based compartmental neuron model, termed dual propagation, in which each neuron is a dyad with two intrinsic states. At inference time these intrinsic states encode the error/activity duality through their difference and their mean respectively. The advantage of this method is that only a single inference phase is needed and that inference can be solved in layerwise closed-form. Experimentally we show on common computer vision datasets, including Imagenet32x32, that dual propagation performs equivalently to back-propagation both in terms of accuracy and runtime.
    Double Permutation Equivariance for Knowledge Graph Completion. (arXiv:2302.01313v1 [cs.LG])
    This work provides a formalization of Knowledge Graphs (KGs) as a new class of graphs that we denote doubly exchangeable attributed graphs, where node and pairwise (joint 2-node) representations must be equivariant to permutations of both node ids and edge (& node) attributes (relations & node features). Double-permutation equivariant KG representations open a new research direction in KGs. We show that this equivariance imposes a structural representation of relations that allows neural networks to perform complex logical reasoning tasks in KGs. Finally, we introduce a general blueprint for such equivariant representations and test a simple GNN-based double-permutation equivariant neural architecture that achieve 100% Hits@10 test accuracy in both the WN18RRv1 and NELL995v1 inductive KG completion tasks, and can accurately perform logical reasoning tasks that no existing methods can perform, to the best of our knowledge.
    Fed-GLOSS-DP: Federated, Global Learning using Synthetic Sets with Record Level Differential Privacy. (arXiv:2302.01068v1 [cs.LG])
    This work proposes Fed-GLOSS-DP, a novel approach to privacy-preserving learning that uses synthetic data to train federated models. In our approach, the server recovers an approximation of the global loss landscape in a local neighborhood based on synthetic samples received from the clients. In contrast to previous, point-wise, gradient-based, linear approximation (such as FedAvg), our formulation enables a type of global optimization that is particularly beneficial in non-IID federated settings. We also present how it rigorously complements record-level differential privacy. Extensive results show that our novel formulation gives rise to considerable improvements in terms of convergence speed and communication costs. We argue that our new approach to federated learning can provide a potential path toward reconciling privacy and accountability by sending differentially private, synthetic data instead of gradient updates. The source code will be released upon publication.
    Practical Bandits: An Industry Perspective. (arXiv:2302.01223v1 [cs.LG])
    The bandit paradigm provides a unified modeling framework for problems that require decision-making under uncertainty. Because many business metrics can be viewed as rewards (a.k.a. utilities) that result from actions, bandit algorithms have seen a large and growing interest from industrial applications, such as search, recommendation and advertising. Indeed, with the bandit lens comes the promise of direct optimisation for the metrics we care about. Nevertheless, the road to successfully applying bandits in production is not an easy one. Even when the action space and rewards are well-defined, practitioners still need to make decisions regarding multi-arm or contextual approaches, on- or off-policy setups, delayed or immediate feedback, myopic or long-term optimisation, etc. To make matters worse, industrial platforms typically give rise to large action spaces in which existing approaches tend to break down. The research literature on these topics is broad and vast, but this can overwhelm practitioners, whose primary aim is to solve practical problems, and therefore need to decide on a specific instantiation or approach for each project. This tutorial will take a step towards filling that gap between the theory and practice of bandits. Our goal is to present a unified overview of the field and its existing terminology, concepts and algorithms -- with a focus on problems relevant to industry. We hope our industrial perspective will help future practitioners who wish to leverage the bandit paradigm for their application.
    Interventional and Counterfactual Inference with Diffusion Models. (arXiv:2302.00860v1 [stat.ML])
    We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings to allow for direct sampling under interventions as well as abduction for counterfactuals. We utilize DCM to model structural equations, seeing that diffusion models serve as a natural candidate here since they encode each node to a latent representation, a proxy for the exogenous noise, and offer flexible and accurate modeling to provide reliable causal statements and estimates. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Our theoretical results provide a methodology for analyzing the counterfactual error for general encoder/decoder models which could be of independent interest.
    Reliable Prediction Intervals with Directly Optimized Inductive Conformal Regression for Deep Learning. (arXiv:2302.00872v1 [cs.LG])
    By generating prediction intervals (PIs) to quantify the uncertainty of each prediction in deep learning regression, the risk of wrong predictions can be effectively controlled. High-quality PIs need to be as narrow as possible, whilst covering a preset proportion of real labels. At present, many approaches to improve the quality of PIs can effectively reduce the width of PIs, but they do not ensure that enough real labels are captured. Inductive Conformal Predictor (ICP) is an algorithm that can generate effective PIs which is theoretically guaranteed to cover a preset proportion of data. However, typically ICP is not directly optimized to yield minimal PI width. However, in this study, we use Directly Optimized Inductive Conformal Regression (DOICR) that takes only the average width of PIs as the loss function and increases the quality of PIs through an optimized scheme under the validity condition that sufficient real labels are captured in the PIs. Benchmark experiments show that DOICR outperforms current state-of-the-art algorithms for regression problems using underlying Deep Neural Network structures for both tabular and image data.
    What Language Reveals about Perception: Distilling Psychophysical Knowledge from Large Language Models. (arXiv:2302.01308v1 [cs.CL])
    Understanding the extent to which the perceptual world can be recovered from language is a fundamental problem in cognitive science. We reformulate this problem as that of distilling psychophysical information from text and show how this can be done by combining large language models (LLMs) with a classic psychophysical method based on similarity judgments. Specifically, we use the prompt auto-completion functionality of GPT3, a state-of-the-art LLM, to elicit similarity scores between stimuli and then apply multidimensional scaling to uncover their underlying psychological space. We test our approach on six perceptual domains and show that the elicited judgments strongly correlate with human data and successfully recover well-known psychophysical structures such as the color wheel and pitch spiral. We also explore meaningful divergences between LLM and human representations. Our work showcases how combining state-of-the-art machine models with well-known cognitive paradigms can shed new light on fundamental questions in perception and language research.
    Are Diffusion Models Vulnerable to Membership Inference Attacks?. (arXiv:2302.01316v1 [cs.CV])
    Diffusion-based generative models have shown great potential for image synthesis, but there is a lack of research on the security and privacy risks they may pose. In this paper, we investigate the vulnerability of diffusion models to Membership Inference Attacks (MIAs), a common privacy concern. Our results indicate that existing MIAs designed for GANs or VAE are largely ineffective on diffusion models, either due to inapplicable scenarios (e.g., requiring the discriminator of GANs) or inappropriate assumptions (e.g., closer distances between synthetic images and member images). To address this gap, we propose Step-wise Error Comparing Membership Inference (SecMI), a black-box MIA that infers memberships by assessing the matching of forward process posterior estimation at each timestep. SecMI follows the common overfitting assumption in MIA where member samples normally have smaller estimation errors, compared with hold-out samples. We consider both the standard diffusion models, e.g., DDPM, and the text-to-image diffusion models, e.g., Stable Diffusion. Experimental results demonstrate that our methods precisely infer the membership with high confidence on both of the two scenarios across six different datasets
    MonoFlow: Rethinking Divergence GANs via the Perspective of Differential Equations. (arXiv:2302.01075v1 [stat.ML])
    The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing mapping of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework.
    Convolutional Autoencoders, Clustering and POD for Low-dimensional Parametrization of Navier-Stokes Equations. (arXiv:2302.01278v1 [math.DS])
    Simulations of large-scale dynamical systems require expensive computations. Low-dimensional parametrization of high-dimensional states such as Proper Orthogonal Decomposition (POD) can be a solution to lessen the burdens by providing a certain compromise between accuracy and model complexity. However, for really low-dimensional parametrizations (for example for controller design) linear methods like the POD come to their natural limits so that nonlinear approaches will be the methods of choice. In this work we propose a convolutional autoencoder (CAE) consisting of a nonlinear encoder and an affine linear decoder and consider combinations with k-means clustering for improved encoding performance. The proposed set of methods is compared to the standard POD approach in two cylinder-wake scenarios modeled by the incompressible Navier-Stokes equations.
    MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion Control in Real Networks. (arXiv:2302.01301v1 [cs.LG])
    Fast and efficient transport protocols are the foundation of an increasingly distributed world. The burden of continuously delivering improved communication performance to support next-generation applications and services, combined with the increasing heterogeneity of systems and network technologies, has promoted the design of Congestion Control (CC) algorithms that perform well under specific environments. The challenge of designing a generic CC algorithm that can adapt to a broad range of scenarios is still an open research question. To tackle this challenge, we propose to apply a novel Reinforcement Learning (RL) approach. Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return and models the learning process as an infinite-horizon task. We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch that researchers have encountered when applying RL to CC. We evaluated our solution on the task of file transfer and compared it to TCP Cubic. While further research is required, results have shown that MARLIN can achieve comparable results to TCP with little hyperparameter tuning, in a task significantly different from its training setting. Therefore, we believe that our work represents a promising first step toward building CC algorithms based on the maximum entropy RL framework.
    Energy Efficiency of Training Neural Network Architectures: An Empirical Study. (arXiv:2302.00967v1 [cs.LG])
    The evaluation of Deep Learning models has traditionally focused on criteria such as accuracy, F1 score, and related measures. The increasing availability of high computational power environments allows the creation of deeper and more complex models. However, the computations needed to train such models entail a large carbon footprint. In this work, we study the relations between DL model architectures and their environmental impact in terms of energy consumed and CO$_2$ emissions produced during training by means of an empirical study using Deep Convolutional Neural Networks. Concretely, we study: (i) the impact of the architecture and the location where the computations are hosted on the energy consumption and emissions produced; (ii) the trade-off between accuracy and energy efficiency; and (iii) the difference on the method of measurement of the energy consumed using software-based and hardware-based tools.
    Curriculum Learning for ab initio Deep Learned Refractive Optics. (arXiv:2302.01089v1 [cs.CV])
    Deep lens optimization has recently emerged as a new paradigm for designing computational imaging systems, however it has been limited to either simple optical systems consisting of a single DOE or metalens, or the fine-tuning of compound lenses from good initial designs. Here we present a deep lens design method based on curriculum learning, which is able to learn optical designs of compound lenses ab initio from randomly initialized surfaces, therefore overcoming the need for a good initial design. We demonstrate this approach with the fully-automatic design of an extended depth-of-field computational camera in a cellphone-style form factor, highly aspherical surfaces, and a short back focal length.
    Convolutional Neural Operators. (arXiv:2302.01178v1 [cs.LG])
    Although very successfully used in machine learning, convolution based neural network architectures -- believed to be inconsistent in function space -- have been largely ignored in the context of learning solution operators of PDEs. Here, we adapt convolutional neural networks to demonstrate that they are indeed able to process functions as inputs and outputs. The resulting architecture, termed as convolutional neural operators (CNOs), is shown to significantly outperform competing models on benchmark experiments, paving the way for the design of an alternative robust and accurate framework for learning operators.
    Online Bidding in Repeated Non-Truthful Auctions under Budget and ROI Constraints. (arXiv:2302.01203v1 [cs.GT])
    Online advertising platforms typically use auction mechanisms to allocate ad placements. Advertisers participate in a series of repeated auctions, and must select bids that will maximize their overall rewards while adhering to certain constraints. We focus on the scenario in which the advertiser has budget and return-on-investment (ROI) constraints. We investigate the problem of budget- and ROI-constrained bidding in repeated non-truthful auctions, such as first-price auctions, and present a best-of-both-worlds framework with no-regret guarantees under both stochastic and adversarial inputs. By utilizing the notion of interval regret, we demonstrate that our framework does not require knowledge of specific parameters of the problem which could be difficult to determine in practice. Our proof techniques can be applied to both the adversarial and stochastic cases with minimal modifications, thereby providing a unified perspective on the two problems. In the adversarial setting, we also show that it is possible to loosen the traditional requirement of having a strictly feasible solution to the offline optimization problem at each round.
    Conditional expectation for missing data imputation. (arXiv:2302.00911v1 [stat.ML])
    Missing data is common in datasets retrieved in various areas, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the logic behind the imputation is explainable, which is especially difficult for complex methods that are for example, based on deep learning. This motivates us to introduce a conditional Distribution based Imputation of Missing Values (DIMV) algorithm. This approach works based on finding the conditional distribution of a feature with missing entries based on the fully observed features. As will be illustrated in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods under comparison; (ii) is explainable; (iii) can provide an approximated confidence region for the missing values in a given sample; (iv) works for both small and large scale data; (v) in many scenarios, does not require a huge number of parameters as deep learning approaches and therefore can be used for mobile devices or web browsers; and (vi) is robust to the normally distributed assumption that its theoretical grounds rely on. In addition to DIMV, we also introduce the DPER* algorithm improving the speed of DPER for estimating the mean and covariance matrix from the data, and we confirm the speed-up via experiments.
    Speed-Oblivious Online Scheduling: Knowing (Precise) Speeds is not Necessary. (arXiv:2302.00985v1 [cs.DS])
    We consider online scheduling on unrelated (heterogeneous) machines in a speed-oblivious setting, where an algorithm is unaware of the exact job-dependent processing speeds. We show strong impossibility results for clairvoyant and non-clairvoyant algorithms and overcome them in models inspired by practical settings: (i) we provide competitive learning-augmented algorithms, assuming that (possibly erroneous) predictions on the speeds are given, and (ii) we provide competitive algorithms for the speed-ordered model, where a single global order of machines according to their unknown job-dependent speeds is known. We prove strong theoretical guarantees and evaluate our findings on a representative heterogeneous multi-core processor. These seem to be the first empirical results for algorithms with predictions that are performed in a non-synthetic environment on real hardware.
    Imitating careful experts to avoid catastrophic events. (arXiv:2302.01193v1 [cs.LG])
    RL is increasingly being used to control robotic systems that interact closely with humans. This interaction raises the problem of safe RL: how to ensure that a RL-controlled robotic system never, for instance, injures a human. This problem is especially challenging in rich, realistic settings where it is not even possible to clearly write down a reward function which incorporates these outcomes. In these circumstances, perhaps the only viable approach is based on IRL, which infers rewards from human demonstrations. However, IRL is massively underdetermined as many different rewards can lead to the same optimal policies; we show that this makes it difficult to distinguish catastrophic outcomes (such as injuring a human) from merely undesirable outcomes. Our key insight is that humans do display different behaviour when catastrophic outcomes are possible: they become much more careful. We incorporate carefulness signals into IRL, and find that they do indeed allow IRL to disambiguate undesirable from catastrophic outcomes, which is critical to ensuring safety in future real-world human-robot interactions.
    Graph Neural Networks for temporal graphs: State of the art, open challenges, and opportunities. (arXiv:2302.01018v1 [cs.LG])
    Graph Neural Networks (GNNs) have become the leading paradigm for learning on (static) graph-structured data. However, many real-world systems are dynamic in nature, since the graph and node/edge attributes change over time. In recent years, GNN-based models for temporal graphs have emerged as a promising area of research to extend the capabilities of GNNs. In this work, we provide the first comprehensive overview of the current state-of-the-art of temporal GNN, introducing a rigorous formalization of learning settings and tasks and a novel taxonomy categorizing existing approaches in terms of how the temporal aspect is represented and processed. We conclude the survey with a discussion of the most relevant open challenges for the field, from both research and application perspectives.
    Confidence and Dispersity Speak: Characterising Prediction Matrix for Unsupervised Accuracy Estimation. (arXiv:2302.01094v1 [cs.LG])
    This work aims to assess how well a model performs under distribution shifts without using labels. While recent methods study prediction confidence, this work reports prediction dispersity is another informative cue. Confidence reflects whether the individual prediction is certain; dispersity indicates how the overall predictions are distributed across all categories. Our key insight is that a well-performing model should give predictions with high confidence and high dispersity. That is, we need to consider both properties so as to make more accurate estimates. To this end, we use the nuclear norm that has been shown to be effective in characterizing both properties. Extensive experiments validate the effectiveness of nuclear norm for various models (e.g., ViT and ConvNeXt), different datasets (e.g., ImageNet and CUB-200), and diverse types of distribution shifts (e.g., style shift and reproduction shift). We show that the nuclear norm is more accurate and robust in accuracy estimation than existing methods. Furthermore, we validate the feasibility of other measurements (e.g., mutual information maximization) for characterizing dispersity and confidence. Lastly, we investigate the limitation of the nuclear norm, study its improved variant under severe class imbalance, and discuss potential directions.
    Laplacian Change Point Detection for Single and Multi-view Dynamic Graphs. (arXiv:2302.01204v1 [cs.LG])
    Dynamic graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly detection in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address three main challenges associated with this problem: i). how to compare graph snapshots across time, ii). how to capture temporal dependencies, and iii). how to combine different views of a temporal graph. To solve the above challenges, we first propose Laplacian Anomaly Detection (LAD) which uses the spectrum of graph Laplacian as the low dimensional embedding of the graph structure at each snapshot. LAD explicitly models short term and long term dependencies by applying two sliding windows. Next, we propose MultiLAD, a simple and effective generalization of LAD to multi-view graphs. MultiLAD provides the first change point detection method for multi-view dynamic graphs. It aggregates the singular values of the normalized graph Laplacian from different views through the scalar power mean operation. Through extensive synthetic experiments, we show that i). LAD and MultiLAD are accurate and outperforms state-of-the-art baselines and their multi-view extensions by a large margin, ii). MultiLAD's advantage over contenders significantly increases when additional views are available, and iii). MultiLAD is highly robust to noise from individual views. In five real world dynamic graphs, we demonstrate that LAD and MultiLAD identify significant events as top anomalies such as the implementation of government COVID-19 interventions which impacted the population mobility in multi-view traffic networks.
    Human not in the loop: objective sample difficulty measures for Curriculum Learning. (arXiv:2302.01243v1 [cs.CV])
    Curriculum learning is a learning method that trains models in a meaningful order from easier to harder samples. A key here is to devise automatic and objective difficulty measures of samples. In the medical domain, previous work applied domain knowledge from human experts to qualitatively assess classification difficulty of medical images to guide curriculum learning, which requires extra annotation efforts, relies on subjective human experience, and may introduce bias. In this work, we propose a new automated curriculum learning technique using the variance of gradients (VoG) to compute an objective difficulty measure of samples and evaluated its effects on elbow fracture classification from X-ray images. Specifically, we used VoG as a metric to rank each sample in terms of the classification difficulty, where high VoG scores indicate more difficult cases for classification, to guide the curriculum training process We compared the proposed technique to a baseline (without curriculum learning), a previous method that used human annotations on classification difficulty, and anti-curriculum learning. Our experiment results showed comparable and higher performance for the binary and multi-class bone fracture classification tasks.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v2 [cs.LG] UPDATED)
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.
    Construction and Applications of Billion-Scale Pre-trained Multimodal Business Knowledge Graph. (arXiv:2209.15214v3 [cs.AI] CROSS LISTED)
    Business Knowledge Graphs (KGs) are important to many enterprises today, providing factual knowledge and structured data that steer many products and make them more intelligent. Despite their promising benefits, building business KG necessitates solving prohibitive issues of deficient structure and multiple modalities. In this paper, we advance the understanding of the practical challenges related to building KG in non-trivial real-world systems. We introduce the process of building an open business knowledge graph (OpenBG) derived from a well-known enterprise, Alibaba Group. Specifically, we define a core ontology to cover various abstract products and consumption demands, with fine-grained taxonomy and multimodal facts in deployed applications. OpenBG is an open business KG of unprecedented scale: 2.6 billion triples with more than 88 million entities covering over 1 million core classes/concepts and 2,681 types of relations. We release all the open resources (OpenBG benchmarks) derived from it for the community and report experimental results of KG-centric tasks. We also run up an online competition based on OpenBG benchmarks, and has attracted thousands of teams. We further pre-train OpenBG and apply it to many KG- enhanced downstream tasks in business scenarios, demonstrating the effectiveness of billion-scale multimodal knowledge for e-commerce. All the resources with codes have been released at \url{https://github.com/OpenBGBenchmark/OpenBG}.
    Constrained Online Two-stage Stochastic Optimization: New Algorithms via Adversarial Learning. (arXiv:2302.00997v1 [cs.LG])
    We consider an online two-stage stochastic optimization with long-term constraints over a finite horizon of $T$ periods. At each period, we take the first-stage action, observe a model parameter realization and then take the second-stage action from a feasible set that depends both on the first-stage decision and the model parameter. We aim to minimize the cumulative objective value while guaranteeing that the long-term average second-stage decision belongs to a set. We propose a general algorithmic framework that derives online algorithms for the online two-stage problem from adversarial learning algorithms. Also, the regret bound of our algorithm cam be reduced to the regret bound of embedded adversarial learning algorithms. Based on our framework, we obtain new results under various settings. When the model parameter at each period is drawn from identical distributions, we derive state-of-art regret bound that improves previous bounds under special cases. Our algorithm is also robust to adversarial corruptions of model parameter realizations. When the model parameters are drawn from unknown non-stationary distributions and we are given prior estimates of the distributions, we develop a new algorithm from our framework with a regret $O(W_T+\sqrt{T})$, where $W_T$ measures the total inaccuracy of the prior estimates.
    Surprising Instabilities in Training Deep Networks and a Theoretical Analysis. (arXiv:2206.02001v3 [cs.LG] UPDATED)
    We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD). We show numerical error (on the order of the smallest floating point bit) induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance, comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.
    Prediction-Powered Inference. (arXiv:2301.09633v2 [stat.ML] UPDATED)
    We introduce prediction-powered inference $\unicode{x2013}$ a framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system. Our framework yields provably valid conclusions without making any assumptions on the machine-learning algorithm that supplies the predictions. Higher accuracy of the predictions translates to smaller confidence intervals, permitting more powerful inference. Prediction-powered inference yields simple algorithms for computing valid confidence intervals for statistical objects such as means, quantiles, and linear and logistic regression coefficients. We demonstrate the benefits of prediction-powered inference with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology.
    GANalyzer: Analysis and Manipulation of GANs Latent Space for Controllable Face Synthesis. (arXiv:2302.00908v1 [cs.CV])
    Generative Adversarial Networks (GANs) are capable of synthesizing high-quality facial images. Despite their success, GANs do not provide any information about the relationship between the input vectors and the generated images. Currently, facial GANs are trained on imbalanced datasets, which generate less diverse images. For example, more than 77% of 100K images that we randomly synthesized using the StyleGAN3 are classified as Happy, and only around 3% are Angry. The problem even becomes worse when a mixture of facial attributes is desired: less than 1% of the generated samples are Angry Woman, and only around 2% are Happy Black. To address these problems, this paper proposes a framework, called GANalyzer, for the analysis, and manipulation of the latent space of well-trained GANs. GANalyzer consists of a set of transformation functions designed to manipulate latent vectors for a specific facial attribute such as facial Expression, Age, Gender, and Race. We analyze facial attribute entanglement in the latent space of GANs and apply the proposed transformation for editing the disentangled facial attributes. Our experimental results demonstrate the strength of GANalyzer in editing facial attributes and generating any desired faces. We also create and release a balanced photo-realistic human face dataset. Our code is publicly available on GitHub.
    Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start". (arXiv:2302.00932v1 [cs.LG])
    Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.
    FCB-SwinV2 Transformer for Polyp Segmentation. (arXiv:2302.01027v1 [cs.CV])
    Polyp segmentation within colonoscopy video frames using deep learning models has the potential to automate the workflow of clinicians. This could help improve the early detection rate and characterization of polyps which could progress to colorectal cancer. Recent state-of-the-art deep learning polyp segmentation models have combined the outputs of Fully Convolutional Network architectures and Transformer Network architectures which work in parallel. In this paper we propose modifications to the current state-of-the-art polyp segmentation model FCBFormer. The transformer architecture of the FCBFormer is replaced with a SwinV2 Transformer-UNET and minor changes to the Fully Convolutional Network architecture are made to create the FCB-SwinV2 Transformer. The performance of the FCB-SwinV2 Transformer is evaluated on the popular colonoscopy segmentation bench-marking datasets Kvasir-SEG and CVC-ClinicDB. Generalizability tests are also conducted. The FCB-SwinV2 Transformer is able to consistently achieve higher mDice scores across all tests conducted and therefore represents new state-of-the-art performance. Issues found with how colonoscopy segmentation model performance is evaluated within literature are also re-ported and discussed. One of the most important issues identified is that when evaluating performance on the CVC-ClinicDB dataset it would be preferable to ensure no data leakage from video sequences occurs during the training/validation/test data partition.
    A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models. (arXiv:2302.01217v1 [stat.ML])
    We provide a theoretical justification for sample recovery using diffusion based image inpainting in a linear model setting. While most inpainting algorithms require retraining with each new mask, we prove that diffusion based inpainting generalizes well to unseen masks without retraining. We analyze a recently proposed popular diffusion based inpainting algorithm called RePaint (Lugmayr et al., 2022), and show that it has a bias due to misalignment that hampers sample recovery even in a two-state diffusion process. Motivated by our analysis, we propose a modified RePaint algorithm we call RePaint$^+$ that provably recovers the underlying true sample and enjoys a linear rate of convergence. It achieves this by rectifying the misalignment error present in drift and dispersion of the reverse process. To the best of our knowledge, this is the first linear convergence result for a diffusion based image inpainting algorithm.
    Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. (arXiv:2205.13134v2 [cs.AI] UPDATED)
    Nonlinear dynamics is ubiquitous in nature and commonly seen in various science and engineering disciplines. Distilling analytical expressions that govern nonlinear dynamics from limited data remains vital but challenging. To tackle this fundamental issue, we propose a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics. The key concept is to interpret mathematical operations and system state variables by computational rules and symbols, establish symbolic reasoning of mathematical formulas via expression trees, and employ a Monte Carlo tree search (MCTS) agent to explore optimal expression trees based on measurement data. The MCTS agent obtains an optimistic selection policy through the traversal of expression trees, featuring the one that maps to the arithmetic expression of underlying physics. Salient features of the proposed framework include search flexibility and enforcement of parsimony for discovered equations. The efficacy and superiority of the SPL machine are demonstrated by numerical examples, compared with state-of-the-art baselines.
    Model Monitoring and Robustness of In-Use Machine Learning Models: Quantifying Data Distribution Shifts Using Population Stability Index. (arXiv:2302.00775v1 [cs.LG])
    Safety goes first. Meeting and maintaining industry safety standards for robustness of artificial intelligence (AI) and machine learning (ML) models require continuous monitoring for faults and performance drops. Deep learning models are widely used in industrial applications, e.g., computer vision, but the susceptibility of their performance to environment changes (e.g., noise) \emph{after deployment} on the product, are now well-known. A major challenge is detecting data distribution shifts that happen, comparing the following: {\bf (i)} development stage of AI and ML models, i.e., train/validation/test, to {\bf (ii)} deployment stage on the product (i.e., even after `testing') in the environment. We focus on a computer vision example related to autonomous driving and aim at detecting shifts that occur as a result of adding noise to images. We use the population stability index (PSI) as a measure of presence and intensity of shift and present results of our empirical experiments showing a promising potential for the PSI. We further discuss multiple aspects of model monitoring and robustness that need to be analyzed \emph{simultaneously} to achieve robustness for industry safety standards. We propose the need for and the research direction toward \emph{categorizations} of problem classes and examples where monitoring for robustness is required and present challenges and pointers for future work from a \emph{practical} perspective.
    Efficient Graph Field Integrators Meet Point Clouds. (arXiv:2302.00942v1 [cs.LG])
    We present two new classes of algorithms for efficient field integration on graphs encoding point clouds. The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Methods (FMMs), which have had a tremendous impact on efficient integration, but for non-Euclidean spaces. We focus on geometries induced by distributions of walk lengths between points (e.g., shortest-path distance). We provide an extensive theoretical analysis of our algorithms, obtaining new results in structural graph theory as a byproduct. We also perform exhaustive empirical evaluation, including on-surface interpolation for rigid and deformable objects (particularly for mesh-dynamics modeling), Wasserstein distance computations for point clouds, and the Gromov-Wasserstein variant.
    Variational Autoencoder Learns Better Feature Representations for EEG-based Obesity Classification. (arXiv:2302.00789v1 [cs.LG])
    Obesity is a common issue in modern societies today that can lead to various diseases and significantly reduced quality of life. Currently, research has been conducted to investigate resting state EEG (electroencephalogram) signals with an aim to identify possible neurological characteristics associated with obesity. In this study, we propose a deep learning-based framework to extract the resting state EEG features for obese and lean subject classification. Specifically, a novel variational autoencoder framework is employed to extract subject-invariant features from the raw EEG signals, which are then classified by a 1-D convolutional neural network. Comparing with conventional machine learning and deep learning methods, we demonstrate the superiority of using VAE for feature extraction, as reflected by the significantly improved classification accuracies, better visualizations and reduced impurity measures in the feature representations. Future work can be directed to gaining an in-depth understanding regarding the spatial patterns that have been learned by the proposed model from a neurological view, as well as improving the interpretability of the proposed model by allowing it to uncover any temporal-related information.
    Using Machine Learning to Develop Smart Reflex Testing Protocols. (arXiv:2302.00794v1 [cs.LG])
    Objective: Reflex testing protocols allow clinical laboratories to perform second line diagnostic tests on existing specimens based on the results of initially ordered tests. Reflex testing can support optimal clinical laboratory test ordering and diagnosis. In current clinical practice, reflex testing typically relies on simple "if-then" rules; however, this limits their scope since most test ordering decisions involve more complexity than a simple rule will allow. Here, using the analyte ferritin as an example, we propose an alternative machine learning-based approach to "smart" reflex testing with a wider scope and greater impact than traditional rule-based approaches. Methods: Using patient data, we developed a machine learning model to predict whether a patient getting CBC testing will also have ferritin testing ordered, consider applications of this model to "smart" reflex testing, and evaluate the model by comparing its performance to possible rule-based approaches. Results: Our underlying machine learning models performed moderately well in predicting ferritin test ordering and demonstrated greater suitability to reflex testing than rule-based approaches. Using chart review, we demonstrate that our model may improve ferritin test ordering. Finally, as a secondary goal, we demonstrate that ferritin test results are missing not at random (MNAR), a finding with implications for unbiased imputation of missing test results. Conclusions: Machine learning may provide a foundation for new types of reflex testing with enhanced benefits for clinical diagnosis and laboratory utilization management.
    Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning. (arXiv:2302.01002v1 [stat.ML])
    We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning.
    Analysis of Biomass Sustainability Indicators from a Machine Learning Perspective. (arXiv:2302.00828v1 [cs.AI])
    Plant biomass estimation is critical due to the variability of different environmental factors and crop management practices associated with it. The assessment is largely impacted by the accurate prediction of different environmental sustainability indicators. A robust model to predict sustainability indicators is a must for the biomass community. This study proposes a robust model for biomass sustainability prediction by analyzing sustainability indicators using machine learning models. The prospect of ensemble learning was also investigated to analyze the regression problem. All experiments were carried out on a crop residue data from the Ohio state. Ten machine learning models, namely, linear regression, ridge regression, multilayer perceptron, k-nearest neighbors, support vector machine, decision tree, gradient boosting, random forest, stacking and voting, were analyzed to estimate three biomass sustainability indicators, namely soil erosion factor, soil conditioning index, and organic matter factor. The performance of the model was assessed using cross-correlation (R2), root mean squared error and mean absolute error metrics. The results showed that Random Forest was the best performing model to assess sustainability indicators. The analyzed model can now serve as a guide for assessing sustainability indicators in real time.
    Empirical Analysis of the AdaBoost's Error Bound. (arXiv:2302.00880v1 [cs.LG])
    Understanding the accuracy limits of machine learning algorithms is essential for data scientists to properly measure performance so they can continually improve their models' predictive capabilities. This study empirically verified the error bound of the AdaBoost algorithm for both synthetic and real-world data. The results show that the error bound holds up in practice, demonstrating its efficiency and importance to a variety of applications. The corresponding source code is available at https://github.com/armanbolatov/adaboost_error_bound.
    Noncommutative $C^*$-algebra Net: Learning Neural Networks with Powerful Product Structure in $C^*$-algebra. (arXiv:2302.01191v1 [math.OA])
    We propose a new generalization of neural networks with noncommutative $C^*$-algebra. An important feature of $C^*$-algebras is their noncommutative structure of products, but the existing $C^*$-algebra net frameworks have only considered commutative $C^*$-algebras. We show that this noncommutative structure of $C^*$-algebras induces powerful effects in learning neural networks. Our framework has a wide range of applications, such as learning multiple related neural networks simultaneously with interactions and learning invariant features with respect to group actions. We also show the validity of our framework numerically, which illustrates its potential power.  ( 2 min )
    Collaborating with language models for embodied reasoning. (arXiv:2302.00763v1 [cs.LG])
    Reasoning in a complex and ambiguous environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the ability to to adapt to new tasks through in-context learning. However, LSLMs do not inherently have the ability to interrogate or intervene on the environment. In this work, we investigate how to combine these complementary abilities in a single system consisting of three parts: a Planner, an Actor, and a Reporter. The Planner is a pre-trained language model that can issue commands to a simple embodied agent (the Actor), while the Reporter communicates with the Planner to inform its next command. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases, and demonstrate how components of this system can be trained with reinforcement-learning to improve performance.
    Molecular Geometry-aware Transformer for accurate 3D Atomic System modeling. (arXiv:2302.00855v1 [q-bio.MN])
    Molecular dynamic simulations are important in computational physics, chemistry, material, and biology. Machine learning-based methods have shown strong abilities in predicting molecular energy and properties and are much faster than DFT calculations. Molecular energy is at least related to atoms, bonds, bond angles, torsion angles, and nonbonding atom pairs. Previous Transformer models only use atoms as inputs which lack explicit modeling of the aforementioned factors. To alleviate this limitation, we propose Moleformer, a novel Transformer architecture that takes nodes (atoms) and edges (bonds and nonbonding atom pairs) as inputs and models the interactions among them using rotational and translational invariant geometry-aware spatial encoding. Proposed spatial encoding calculates relative position information including distances and angles among nodes and edges. We benchmark Moleformer on OC20 and QM9 datasets, and our model achieves state-of-the-art on the initial state to relaxed energy prediction of OC20 and is very competitive in QM9 on predicting quantum chemical properties compared to other Transformer and Graph Neural Network (GNN) methods which proves the effectiveness of the proposed geometry-aware spatial encoding in Moleformer.
    FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features. (arXiv:2302.00787v1 [cs.LG])
    The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
    Rethinking Warm-Starts with Predictions: Learning Predictions Close to Sets of Optimal Solutions for Faster $\text{L}$-/$\text{L}^\natural$-Convex Function Minimization. (arXiv:2302.00928v1 [cs.LG])
    An emerging line of work has shown that machine-learned predictions are useful to warm-start algorithms for discrete optimization problems, such as bipartite matching. Previous studies have shown time complexity bounds proportional to some distance between a prediction and an optimal solution, which we can approximately minimize by learning predictions from past optimal solutions. However, such guarantees may not be meaningful when multiple optimal solutions exist. Indeed, the dual problem of bipartite matching and, more generally, $\text{L}$-/$\text{L}^\natural$-convex function minimization have arbitrarily many optimal solutions, making such prediction-dependent bounds arbitrarily large. To resolve this theoretically critical issue, we present a new warm-start-with-prediction framework for $\text{L}$-/$\text{L}^\natural$-convex function minimization. Our framework offers time complexity bounds proportional to the distance between a prediction and the set of all optimal solutions. The main technical difficulty lies in learning predictions that are provably close to sets of all optimal solutions, for which we present an online-gradient-descent-based method. We thus give the first polynomial-time learnability of predictions that can provably warm-start algorithms regardless of multiple optimal solutions.
    Mnemosyne: Learning to Train Transformers with Transformers. (arXiv:2302.01128v1 [cs.LG])
    Training complex machine learning (ML) architectures requires a compute and time consuming process of selecting the right optimizer and tuning its hyper-parameters. A new paradigm of learning optimizers from data has emerged as a better alternative to hand-designed ML optimizers. We propose Mnemosyne optimizer, that uses Performers: implicit low-rank attention Transformers. It can learn to train entire neural network architectures including other Transformers without any task-specific optimizer tuning. We show that Mnemosyne: (a) generalizes better than popular LSTM optimizer, (b) in particular can successfully train Vision Transformers (ViTs) while meta--trained on standard MLPs and (c) can initialize optimizers for faster convergence in Robotics applications. We believe that these results open the possibility of using Transformers to build foundational optimization models that can address the challenges of regular Transformer training. We complement our results with an extensive theoretical analysis of the compact associative memory used by Mnemosyne.  ( 2 min )
    Causal Lifting and Link Prediction. (arXiv:2302.01198v1 [cs.LG])
    Current state-of-the-art causal models for link prediction assume an underlying set of inherent node factors -- an innate characteristic defined at the node's birth -- that governs the causal evolution of links in the graph. In some causal tasks, however, link formation is path-dependent, i.e., the outcome of link interventions depends on existing links. For instance, in the customer-product graph of an online retailer, the effect of an 85-inch TV ad (treatment) likely depends on whether the costumer already has an 85-inch TV. Unfortunately, existing causal methods are impractical in these scenarios. The cascading functional dependencies between links (due to path dependence) are either unidentifiable or require an impractical number of control variables. In order to remedy this shortcoming, this work develops the first causal model capable of dealing with path dependencies in link prediction. It introduces the concept of causal lifting, an invariance in causal models that, when satisfied, allows the identification of causal link prediction queries using limited interventional data. On the estimation side, we show how structural pairwise embeddings -- a type of symmetry-based joint representation of node pairs in a graph -- exhibit lower bias and correctly represent the causal structure of the task, as opposed to existing node embedding methods, e.g., GNNs and matrix factorization. Finally, we validate our theoretical findings on four datasets under three different scenarios for causal link prediction tasks: knowledge base completion, covariance matrix estimation and consumer-product recommendations.  ( 2 min )
    De Novo Molecular Generation via Connection-aware Motif Mining. (arXiv:2302.01129v1 [cs.LG])
    De novo molecular generation is an essential task for science discovery. Recently, fragment-based deep generative models have attracted much research attention due to their flexibility in generating novel molecules based on existing molecule fragments. However, the motif vocabulary, i.e., the collection of frequent fragments, is usually built upon heuristic rules, which brings difficulties to capturing common substructures from large amounts of molecules. In this work, we propose a new method, MiCaM, to generate molecules based on mined connection-aware motifs. Specifically, it leverages a data-driven algorithm to automatically discover motifs from a molecule library by iteratively merging subgraphs based on their frequency. The obtained motif vocabulary consists of not only molecular motifs (i.e., the frequent fragments), but also their connection information, indicating how the motifs are connected with each other. Based on the mined connection-aware motifs, MiCaM builds a connection-aware generator, which simultaneously picks up motifs and determines how they are connected. We test our method on distribution-learning benchmarks (i.e., generating novel molecules to resemble the distribution of a given training set) and goal-directed benchmarks (i.e., generating molecules with target properties), and achieve significant improvements over previous fragment-based baselines. Furthermore, we demonstrate that our method can effectively mine domain-specific motifs for different tasks.  ( 2 min )
    QCM-SGM+: Improved Quantized Compressed Sensing With Score-Based Generative Models for General Sensing Matrices. (arXiv:2302.00919v1 [eess.SP])
    In realistic compressed sensing (CS) scenarios, the obtained measurements usually have to be quantized to a finite number of bits before transmission and/or storage, thus posing a challenge in recovery, especially for extremely coarse quantization such as 1-bit sign measurements. Recently Meng & Kabashima proposed an efficient quantized compressed sensing algorithm called QCS-SGM using the score-based generative models as an implicit prior. Thanks to the power of score-based generative models in capturing the rich structure of the prior, QCS-SGM achieves remarkably better performances than previous quantized CS methods. However, QCS-SGM is restricted to (approximately) row-orthogonal sensing matrices since otherwise the likelihood score becomes intractable. To address this challenging problem, in this paper we propose an improved version of QCS-SGM, which we call QCS-SGM+, which also works well for general matrices. The key idea is a Bayesian inference perspective of the likelihood score computation, whereby an expectation propagation algorithm is proposed to approximately compute the likelihood score. Experiments on a variety of baseline datasets demonstrate that the proposed QCS-SGM+ outperforms QCS-SGM by a large margin when sensing matrices are far from row-orthogonal.
    An Enhanced V-cycle MgNet Model for Operator Learning in Numerical Partial Differential Equations. (arXiv:2302.00938v1 [cs.LG])
    This study used a multigrid-based convolutional neural network architecture known as MgNet in operator learning to solve numerical partial differential equations (PDEs). Given the property of smoothing iterations in multigrid methods where low-frequency errors decay slowly, we introduced a low-frequency correction structure for residuals to enhance the standard V-cycle MgNet. The enhanced MgNet model can capture the low-frequency features of solutions considerably better than the standard V-cycle MgNet. The numerical results obtained using some standard operator learning tasks are better than those obtained using many state-of-the-art methods, demonstrating the efficiency of our model.Moreover, numerically, our new model is more robust in case of low- and high-resolution data during training and testing, respectively.
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data. (arXiv:2302.00834v1 [cs.LG])
    We study the interpolation, or memorization, power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints.
    Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment. (arXiv:2302.00902v1 [cs.LG])
    Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained language models.
    CLIPood: Generalizing CLIP to Out-of-Distributions. (arXiv:2302.00864v1 [cs.LG])
    Out-of-distribution (OOD) generalization, where the model needs to handle distribution shifts from training, is a major challenge of machine learning. Recently, contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, revealing a promising path toward OOD generalization. However, to boost upon zero-shot performance, further adaptation of CLIP on downstream tasks is indispensable but undesirably degrades OOD generalization ability. In this paper, we aim at generalizing CLIP to out-of-distribution test data on downstream tasks. Beyond the two canonical OOD situations, domain shift and open class, we tackle a more general but difficult in-the-wild setting where both OOD situations may occur on the unseen test data. We propose CLIPood, a simple fine-tuning method that can adapt CLIP models to all OOD situations. To exploit semantic relations between classes from the text modality, CLIPood introduces a new training objective, margin metric softmax (MMS), with class adaptive margins for fine-tuning. Moreover, to incorporate both the pre-trained zero-shot model and the fine-tuned task-adaptive model, CLIPood proposes a new Beta moving average (BMA) to maintain a temporal ensemble according to Beta distribution. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
    MTP-GO: Graph-Based Probabilistic Multi-Agent Trajectory Prediction with Neural ODEs. (arXiv:2302.00735v1 [cs.RO])
    Enabling resilient autonomous motion planning requires robust predictions of surrounding road users' future behavior. In response to this need and the associated challenges, we introduce our model, titled MTP-GO. The model encodes the scene using temporal graph neural networks to produce the inputs to an underlying motion model. The motion model is implemented using neural ordinary differential equations where the state-transition functions are learned with the rest of the model. Multi-modal probabilistic predictions are provided by combining the concept of mixture density networks and Kalman filtering. The results illustrate the predictive capabilities of the proposed model across various data sets, outperforming several state-of-the-art methods on a number of metrics.  ( 2 min )
    LMC: Fast Training of GNNs via Subgraph Sampling with Provable Convergence. (arXiv:2302.00924v1 [cs.LG])
    The message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. However, training GNNs on large-scale graphs suffers from the well-known neighbor explosion problem, i.e., the exponentially increasing dependencies of nodes with the number of message passing layers. Subgraph-wise sampling methods -- a promising class of mini-batch training techniques -- discard messages outside the mini-batches in backward passes to avoid the neighbor explosion problem at the expense of gradient estimation accuracy. This poses significant challenges to their convergence analysis and convergence speeds, which seriously limits their reliable real-world applications. To address this challenge, we propose a novel subgraph-wise sampling method with a convergence guarantee, namely Local Message Compensation (LMC). To the best of our knowledge, LMC is the {\it first} subgraph-wise sampling method with provable convergence. The key idea of LMC is to retrieve the discarded messages in backward passes based on a message passing formulation of backward passes. By efficient and effective compensations for the discarded messages in both forward and backward passes, LMC computes accurate mini-batch gradients and thus accelerates convergence. We further show that LMC converges to first-order stationary points of GNNs. Experiments on large-scale benchmark tasks demonstrate that LMC significantly outperforms state-of-the-art subgraph-wise sampling methods in terms of efficiency.  ( 2 min )
    Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent. (arXiv:2302.00849v1 [cs.LG])
    It is well known that the finite step-size ($h$) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter $\beta$ play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?". To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of $(\frac{1+\beta}{1-\beta})$, thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.
    The Contextual Lasso: Sparse Linear Models via Deep Neural Networks. (arXiv:2302.00878v1 [stat.ML])
    Sparse linear models are a gold standard tool for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which we wish to explain the model's predictions, and contextual features, which we wish to determine the model's explanations. This dichotomy leads us to propose the contextual lasso, a new statistical estimator that fits a sparse linear model whose sparsity pattern and coefficients can vary with the contextual features. The fitting process involves learning a nonparametric map, realized via a deep neural network, from contextual feature vector to sparse coefficient vector. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. Extensive experiments on real and synthetic data suggest that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.
    High-precision regressors for particle physics. (arXiv:2302.00753v1 [physics.comp-ph])
    Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of $10^9$ - $10^{12}$, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than $1\%$ or even $0.1\%$ over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of $10^3$ - $10^6$ over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.
    Synthesizing Physical Character-Scene Interactions. (arXiv:2302.00883v1 [cs.GR])
    Movement is how people interact with and affect their environment. For realistic character animation, it is necessary to synthesize such interactions between virtual characters and their surroundings. Despite recent progress in character animation using machine learning, most systems focus on controlling an agent's movements in fairly simple and homogeneous environments, with limited interactions with other objects. Furthermore, many previous approaches that synthesize human-scene interactions require significant manual labeling of the training data. In contrast, we present a system that uses adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a natural and life-like manner. Our method learns scene interaction behaviors from large unstructured motion datasets, without manual annotation of the motion data. These scene interactions are learned using an adversarial discriminator that evaluates the realism of a motion within the context of a scene. The key novelty involves conditioning both the discriminator and the policy networks on scene context. We demonstrate the effectiveness of our approach through three challenging scene interaction tasks: carrying, sitting, and lying down, which require coordination of a character's movements in relation to objects in the environment. Our policies learn to seamlessly transition between different behaviors like idling, walking, and sitting. By randomizing the properties of the objects and their placements during training, our method is able to generalize beyond the objects and scenarios depicted in the training dataset, producing natural character-scene interactions for a wide variety of object shapes and placements. The approach takes physics-based character motion generation a step closer to broad applicability.
    STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition. (arXiv:2302.01172v1 [cs.LG])
    Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.  ( 2 min )
    Stochastic Contextual Bandits with Long Horizon Rewards. (arXiv:2302.00814v1 [cs.LG])
    The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
    Recurrent Graph Convolutional Networks for Spatiotemporal Prediction of Snow Accumulation Using Airborne Radar. (arXiv:2302.00817v1 [cs.LG])
    The accurate prediction and estimation of annual snow accumulation has grown in importance as we deal with the effects of climate change and the increase of global atmospheric temperatures. Airborne radar sensors, such as the Snow Radar, are able to measure accumulation rate patterns at a large-scale and monitor the effects of ongoing climate change on Greenland's precipitation and run-off. The Snow Radar's use of an ultra-wide bandwidth enables a fine vertical resolution that helps in capturing internal ice layers. Given the amount of snow accumulation in previous years using the radar data, in this paper, we propose a machine learning model based on recurrent graph convolutional networks to predict the snow accumulation in recent consecutive years at a certain location. We found that the model performs better and with more consistency than equivalent nongeometric and nontemporal models.
    Disentanglement of Latent Representations via Sparse Causal Interventions. (arXiv:2302.00869v1 [cs.LG])
    The process of generating data such as images is controlled by independent and unknown factors of variation. The retrieval of these variables has been studied extensively in the disentanglement, causal representation learning, and independent component analysis fields. Recently, approaches merging these domains together have shown great success. Instead of directly representing the factors of variation, the problem of disentanglement can be seen as finding the interventions on one image that yield a change to a single factor. Following this assumption, we introduce a new method for disentanglement inspired by causal dynamics that combines causality theory with vector-quantized variational autoencoders. Our model considers the quantized vectors as causal variables and links them in a causal graph. It performs causal interventions on the graph and generates atomic transitions affecting a unique factor of variation in the image. We also introduce a new task of action retrieval that consists of finding the action responsible for the transition between two images. We test our method on standard synthetic and real-world disentanglement datasets. We show that it can effectively disentangle the factors of variation and perform precise interventions on high-level semantic attributes of an image without affecting its quality, even with imbalanced data distributions.
    A Light-weight CNN Model for Efficient Parkinson's Disease Diagnostics. (arXiv:2302.00973v1 [stat.ML])
    In recent years, deep learning methods have achieved great success in various fields due to their strong performance in practical applications. In this paper, we present a light-weight neural network for Parkinson's disease diagnostics, in which a series of hand-drawn data are collected to distinguish Parkinson's disease patients from healthy control subjects. The proposed model consists of a convolution neural network (CNN) cascading to long-short-term memory (LSTM) to adapt the characteristics of collected time-series signals. To make full use of their advantages, a multilayered LSTM model is firstly used to enrich features which are then concatenated with raw data and fed into a shallow one-dimensional (1D) CNN model for efficient classification. Experimental results show that the proposed model achieves a high-quality diagnostic result over multiple evaluation metrics with much fewer parameters and operations, outperforming conventional methods such as support vector machine (SVM), random forest (RF), lightgbm (LGB) and CNN-based methods.
    Meta Learning in Decentralized Neural Networks: Towards More General AI. (arXiv:2302.01020v1 [cs.LG])
    Meta-learning usually refers to a learning algorithm that learns from other learning algorithms. The problem of uncertainty in the predictions of neural networks shows that the world is only partially predictable and a learned neural network cannot generalize to its ever-changing surrounding environments. Therefore, the question is how a predictive model can represent multiple predictions simultaneously. We aim to provide a fundamental understanding of learning to learn in the contents of Decentralized Neural Networks (Decentralized NNs) and we believe this is one of the most important questions and prerequisites to building an autonomous intelligence machine. To this end, we shall demonstrate several pieces of evidence for tackling the problems above with Meta Learning in Decentralized NNs. In particular, we will present three different approaches to building such a decentralized learning system: (1) learning from many replica neural networks, (2) building the hierarchy of neural networks for different functions, and (3) leveraging different modality experts to learn cross-modal representations.
    Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control. (arXiv:2302.00839v1 [cs.LG])
    Many real-world multi-label prediction problems involve set-valued predictions that must satisfy specific requirements dictated by downstream usage. We focus on a typical scenario where such requirements, separately encoding \textit{value} and \textit{cost}, compete with each other. For instance, a hospital might expect a smart diagnosis system to capture as many severe, often co-morbid, diseases as possible (the value), while maintaining strict control over incorrect predictions (the cost). We present a general pipeline, dubbed as FavMac, to maximize the value while controlling the cost in such scenarios. FavMac can be combined with almost any multi-label classifier, affording distribution-free theoretical guarantees on cost control. Moreover, unlike prior works, FavMac can handle real-world large-scale applications via a carefully designed online update mechanism, which is of independent interest. Our methodological and theoretical contributions are supported by experiments on several healthcare tasks and synthetic datasets - FavMac furnishes higher value compared with several variants and baselines while maintaining strict cost control.
    Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery. (arXiv:2302.00755v1 [stat.ML])
    In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient prediction and uncertainty quantification. Standard GPs, however, do not capture structured sparsity on the underlying response surface, which is present in many applications, particularly in the physical sciences. We thus propose a new hierarchical shrinkage GP (HierGP), which incorporates such structure via cumulative shrinkage priors within a GP framework. We show that the HierGP implicitly embeds the well-known principles of effect sparsity, heredity and hierarchy for analysis of experiments, which allows our model to identify structured sparse features from the response surface with limited data. We propose efficient posterior sampling algorithms for model training and prediction, and prove desirable consistency properties for the HierGP. Finally, we demonstrate the improved performance of HierGP over existing models, in a suite of numerical experiments and an application to dynamical system recovery.
    Privacy Risk for anisotropic Langevin dynamics using relative entropy bounds. (arXiv:2302.00766v1 [cs.LG])
    The privacy preserving properties of Langevin dynamics with additive isotropic noise have been extensively studied. However, the isotropic noise assumption is very restrictive: (a) when adding noise to existing learning algorithms to preserve privacy and maintain the best possible accuracy one should take into account the relative magnitude of the outputs and their correlations; (b) popular algorithms such as stochastic gradient descent (and their continuous time limits) appear to possess anisotropic covariance properties. To study the privacy risks for the anisotropic noise case, one requires general results on the relative entropy between the laws of two Stochastic Differential Equations with different drifts and diffusion coefficients. Our main contribution is to establish such a bound using stability estimates for solutions to the Fokker-Planck equations via functional inequalities. With additional assumptions, the relative entropy bound implies an $(\epsilon,\delta)$-differential privacy bound. We discuss the practical implications of our bound related to privacy risk in different contexts.Finally, the benefits of anisotropic noise are illustrated using numerical results on optimising a quadratic loss or calibrating a neural network.
    A Survey on Compositional Generalization in Applications. (arXiv:2302.01067v1 [cs.AI])
    The field of compositional generalization is currently experiencing a renaissance in AI, as novel problem settings and algorithms motivated by various practical applications are being introduced, building on top of the classical compositional generalization problem. This article aims to provide a comprehensive review of top recent developments in multiple real-life applications of the compositional generalization. Specifically, we introduce a taxonomy of common applications and summarize the state-of-the-art for each of those domains. Furthermore, we identify important current trends and provide new perspectives pertaining to the future of this burgeoning field.  ( 2 min )
    Pathologies of Predictive Diversity in Deep Ensembles. (arXiv:2302.00704v1 [cs.LG])
    Classical results establish that ensembles of small models benefit when predictive diversity is encouraged, through bagging, boosting, and similar. Here we demonstrate that this intuition does not carry over to ensembles of deep neural networks used for classification, and in fact the opposite can be true. Unlike regression models or small (unconfident) classifiers, predictions from large (confident) neural networks concentrate in vertices of the probability simplex. Thus, decorrelating these points necessarily moves the ensemble prediction away from vertices, harming confidence and moving points across decision boundaries. Through large scale experiments, we demonstrate that diversity-encouraging regularizers hurt the performance of high-capacity deep ensembles used for classification. Even more surprisingly, discouraging predictive diversity can be beneficial. Together this work strongly suggests that the best strategy for deep ensembles is utilizing more accurate, but likely less diverse, component models.
    Generative Modeling with Quantum Neurons. (arXiv:2302.00788v1 [quant-ph])
    The recently proposed Quantum Neuron Born Machine (QNBM) has demonstrated quality initial performance as the first quantum generative machine learning (ML) model proposed with non-linear activations. However, previous investigations have been limited in scope with regards to the model's learnability and simulatability. In this work, we make a considerable leap forward by providing an extensive deep dive into the QNBM's potential as a generative model. We first demonstrate that the QNBM's network representation makes it non-trivial to be classically efficiently simulated. Following this result, we showcase the model's ability to learn (express and train on) a wider set of probability distributions, and benchmark the performance against a classical Restricted Boltzmann Machine (RBM). The QNBM is able to outperform this classical model on all distributions, even for the most optimally trained RBM among our simulations. Specifically, the QNBM outperforms the RBM with an improvement factor of 75.3x, 6.4x, and 3.5x for the discrete Gaussian, cardinality-constrained, and Bars and Stripes distributions respectively. Lastly, we conduct an initial investigation into the model's generalization capabilities and use a KL test to show that the model is able to approximate the ground truth probability distribution more closely than the training distribution when given access to a limited amount of data. Overall, we put forth a stronger case in support of using the QNBM for larger-scale generative tasks.
    Causal Effect Estimation: Recent Advances, Challenges, and Opportunities. (arXiv:2302.00848v1 [cs.LG])
    Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., propensity score, matching, and reweighing) and advanced machine learning approaches (e.g., representation learning, adversarial learning, and graph neural networks). Although the advanced machine learning approaches have shown extraordinary performance in treatment effect estimation, it also comes with a lot of new topics and new research questions. In view of the latest research efforts in the causal inference field, we provide a comprehensive discussion of challenges and opportunities for the three core components of the treatment effect estimation task, i.e., treatment, covariates, and outcome. In addition, we showcase the promising research directions of this topic from multiple perspectives.
    The Weisfeiler-Lehman Distance: Reinterpretation and Connection with GNNs. (arXiv:2302.00713v1 [cs.LG])
    In this paper, we present a novel interpretation of the so-called Weisfeiler-Lehman (WL) distance, introduced by Chen et al. (2022), using concepts from stochastic processes. The WL distance aims at comparing graphs with node features, has the same discriminative power as the classic Weisfeiler-Lehman graph isomorphism test and has deep connections to the Gromov-Wasserstein distance. This new interpretation connects the WL distance to the literature on distances for stochastic processes, which also makes the interpretation of the distance more accessible and intuitive. We further explore the connections between the WL distance and certain Message Passing Neural Networks, and discuss the implications of the WL distance for understanding the Lipschitz property and the universal approximation results for these networks.
    ImageNomer: developing an fMRI and omics visualization tool to detect racial bias in functional connectivity. (arXiv:2302.00767v1 [q-bio.PE])
    It can be difficult to identify trends and perform quality control in large, high-dimensional fMRI or omics datasets. To remedy this, we develop ImageNomer, a data visualization and analysis tool that allows inspection of both subject-level and cohort-level features. The tool allows visualization of phenotype correlation with functional connectivity (FC), partial connectivity (PC), dictionary components (PCA and our own method), and genomic data (single-nucleotide polymorphisms, SNPs). In addition, it allows visualization of weights from arbitrary ML models. ImageNomer is built with a Python backend and a Vue frontend. We validate ImageNomer using the Philadelphia Neurodevelopmental Cohort (PNC) dataset, which contains multitask fMRI and SNP data of healthy adolescents. Using correlation, greedy selection, or model weights, we find that a set of 10 FC features can explain 15% of variation in age, compared to 35% for the full 34,716 feature model. The four most significant FCs are either between bilateral default mode network (DMN) regions or spatially proximal subcortical areas. Additionally, we show that whereas both FC (fMRI) and SNPs (genomic) features can account for 10-15% of intelligence variation, this predictive ability disappears when controlling for race. We find that FC features can be used to predict race with 85% accuracy, compared to 78% accuracy for sex prediction. Using ImageNomer, this work casts doubt on the possibility of finding unbiased intelligence-related features in fMRI and SNPs of healthy adolescents.
    Riemannian Stochastic Approximation for Minimizing Tame Nonsmooth Objective Functions. (arXiv:2302.00709v1 [cs.LG])
    In many learning applications, the parameters in a model are structurally constrained in a way that can be modeled as them lying on a Riemannian manifold. Riemannian optimization, wherein procedures to enforce an iterative minimizing sequence to be constrained to the manifold, is used to train such models. At the same time, tame geometry has become a significant topological description of nonsmooth functions that appear in the landscapes of training neural networks and other important models with structural compositions of continuous nonlinear functions with nonsmooth maps. In this paper, we study the properties of such stratifiable functions on a manifold and the behavior of retracted stochastic gradient descent, with diminishing stepsizes, for minimizing such functions.
    Sample Complexity of Kernel-Based Q-Learning. (arXiv:2302.00727v1 [cs.LG])
    Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
    Approximating the Shapley Value without Marginal Contributions. (arXiv:2302.00736v1 [cs.LG])
    The Shapley value is arguably the most popular approach for assigning a meaningful contribution value to players in a cooperative game, which has recently been used intensively in various areas of machine learning, most notably in explainable artificial intelligence. The meaningfulness is due to axiomatic properties that only the Shapley value satisfies, which, however, comes at the expense of an exact computation growing exponentially with the number of agents. Accordingly, a number of works are devoted to the efficient approximation of the Shapley values, all of which revolve around the notion of an agent's marginal contribution. In this paper, we propose with SVARM and Stratified SVARM two parameter-free and domain-independent approximation algorithms based on a representation of the Shapley value detached from the notion of marginal contributions. We prove unmatched theoretical guarantees regarding their approximation quality and provide satisfying empirical results.
    Neural Networks for Symbolic Regression. (arXiv:2302.00773v1 [cs.NE])
    Many real-world systems can be described by mathematical formulas that are human-comprehensible, easy to analyze and can be helpful in explaining the system's behaviour. Symbolic regression is a method that generates nonlinear models from data in the form of analytic expressions. Historically, symbolic regression has been predominantly realized using genetic programming, a method that iteratively evolves a population of candidate solutions that are sampled by genetic operators crossover and mutation. This gradient-free evolutionary approach suffers from several deficiencies: it does not scale well with the number of variables and samples in the training data, models tend to grow in size and complexity without an adequate accuracy gain, and it is hard to fine-tune the inner model coefficients using just genetic operators. Recently, neural networks have been applied to learn the whole analytic formula, i.e., its structure as well as the coefficients, by means of gradient-based optimization algorithms. We propose a novel neural network-based symbolic regression method that constructs physically plausible models based on limited training data and prior knowledge about the system. The method employs an adaptive weighting scheme to effectively deal with multiple loss function terms and an epoch-wise learning process to reduce the chance of getting stuck in poor local optima. Furthermore, we propose a parameter-free method for choosing the model with the best interpolation and extrapolation performance out of all models generated through the whole learning process. We experimentally evaluate the approach on the TurtleBot 2 mobile robot, the magnetic manipulation system, the equivalent resistance of two resistors in parallel, and the anti-lock braking system. The results clearly show the potential of the method to find sparse and accurate models that comply with the prior knowledge provided.
    A Survey of Deep Learning: From Activations to Transformers. (arXiv:2302.00722v1 [cs.LG])
    Deep learning has made tremendous progress in the last decade. A key success factor is the large amount of architectures, layers, objectives, and optimization techniques that have emerged in recent years. They include a myriad of variants related to attention, normalization, skip connection, transformer and self-supervised learning schemes -- to name a few. We provide a comprehensive overview of the most important, recent works in these areas to those who already have a basic understanding of deep learning. We hope that a holistic and unified treatment of influential, recent works helps researchers to form new connections between diverse areas of deep learning.
    Universal Soldier: Using Universal Adversarial Perturbations for Detecting Backdoor Attacks. (arXiv:2302.00747v1 [cs.LG])
    Deep learning models achieve excellent performance in numerous machine learning tasks. Yet, they suffer from security-related issues such as adversarial examples and poisoning (backdoor) attacks. A deep learning model may be poisoned by training with backdoored data or by modifying inner network parameters. Then, a backdoored model performs as expected when receiving a clean input, but it misclassifies when receiving a backdoored input stamped with a pre-designed pattern called "trigger". Unfortunately, it is difficult to distinguish between clean and backdoored models without prior knowledge of the trigger. This paper proposes a backdoor detection method by utilizing a special type of adversarial attack, universal adversarial perturbation (UAP), and its similarities with a backdoor trigger. We observe an intuitive phenomenon: UAPs generated from backdoored models need fewer perturbations to mislead the model than UAPs from clean models. UAPs of backdoored models tend to exploit the shortcut from all classes to the target class, built by the backdoor trigger. We propose a novel method called Universal Soldier for Backdoor detection (USB) and reverse engineering potential backdoor triggers via UAPs. Experiments on 345 models trained on several datasets show that USB effectively detects the injected backdoor and provides comparable or better results than state-of-the-art methods.
    Domain Generalization Emerges from Dreaming. (arXiv:2302.00980v1 [cs.CV])
    Recent studies have proven that DNNs, unlike human vision, tend to exploit texture information rather than shape. Such texture bias is one of the factors for the poor generalization performance of DNNs. We observe that the texture bias negatively affects not only in-domain generalization but also out-of-distribution generalization, i.e., Domain Generalization. Motivated by the observation, we propose a new framework to reduce the texture bias of a model by a novel optimization-based data augmentation, dubbed Stylized Dream. Our framework utilizes adaptive instance normalization (AdaIN) to augment the style of an original image yet preserve the content. We then adopt a regularization loss to predict consistent outputs between Stylized Dream and original images, which encourages the model to learn shape-based representations. Extensive experiments show that the proposed method achieves state-of-the-art performance in out-of-distribution settings on public benchmark datasets: PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet.  ( 2 min )
    Real-Time Evaluation in Online Continual Learning: A New Paradigm. (arXiv:2302.01047v1 [cs.LG])
    Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We hypothesize that under this new evaluation paradigm, computationally demanding CL approaches may perform poorly on streams with a varying distribution. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods.  ( 2 min )
    Avoiding Model Estimation in Robust Markov Decision Processes with a Generative Model. (arXiv:2302.01248v1 [stat.ML])
    Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.
    Randomized prior wavelet neural operator for uncertainty quantification. (arXiv:2302.01051v1 [stat.ML])
    In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with inherent uncertainty quantification module and hence, is expected to be extremely useful for scientists and engineers alike. RP-WNO utilizes randomized prior networks, which can account for prior information and is easier to implement for large, complex deep-learning architectures than its Bayesian counterpart. Four examples have been solved to test the proposed framework, and the results produced advocate favorably for the efficacy of the proposed framework.  ( 2 min )
    Reinforcement learning-based estimation for partial differential equations. (arXiv:2302.01189v1 [cs.LG])
    In systems governed by nonlinear partial differential equations such as fluid flows, the design of state estimators such as Kalman filters relies on a reduced-order model (ROM) that projects the original high-dimensional dynamics onto a computationally tractable low-dimensional space. However, ROMs are prone to large errors, which negatively affects the performance of the estimator. Here, we introduce the reinforcement learning reduced-order estimator (RL-ROE), a ROM-based estimator in which the correction term that takes in the measurements is given by a nonlinear policy trained through reinforcement learning. The nonlinearity of the policy enables the RL-ROE to compensate efficiently for errors of the ROM, while still taking advantage of the imperfect knowledge of the dynamics. Using examples involving the Burgers and Navier-Stokes equations, we show that in the limit of very few sensors, the trained RL-ROE outperforms a Kalman filter designed using the same ROM. Moreover, it yields accurate high-dimensional state estimates for reference trajectories corresponding to various physical parameter values, without direct knowledge of the latter.  ( 2 min )
    Neural Common Neighbor with Completion for Link Prediction. (arXiv:2302.00890v1 [cs.LG])
    Despite its outstanding performance in various graph tasks, vanilla Message Passing Neural Network (MPNN) usually fails in link prediction tasks, as it only uses representations of two individual target nodes and ignores the pairwise relation between them. To capture the pairwise relations, some models add manual features to the input graph and use the output of MPNN to produce pairwise representations. In contrast, others directly use manual features as pairwise representations. Though this simplification avoids applying a GNN to each link individually and thus improves scalability, these models still have much room for performance improvement due to the hand-crafted and unlearnable pairwise features. To upgrade performance while maintaining scalability, we propose Neural Common Neighbor (NCN), which uses learnable pairwise representations. To further boost NCN, we study the unobserved link problem. The incompleteness of the graph is ubiquitous and leads to distribution shifts between the training and test set, loss of common neighbor information, and performance degradation of models. Therefore, we propose two intervention methods: common neighbor completion and target link removal. Combining the two methods with NCN, we propose Neural Common Neighbor with Completion (NCNC). NCN and NCNC outperform recent strong baselines by large margins. NCNC achieves state-of-the-art performance in link prediction tasks.  ( 2 min )
    Energy Efficient Training of SNN using Local Zeroth Order Method. (arXiv:2302.00910v1 [cs.LG])
    Spiking neural networks are becoming increasingly popular for their low energy requirement in real-world tasks with accuracy comparable to the traditional ANNs. SNN training algorithms face the loss of gradient information and non-differentiability due to the Heaviside function in minimizing the model loss over model parameters. To circumvent the problem surrogate method uses a differentiable approximation of the Heaviside in the backward pass, while the forward pass uses the Heaviside as the spiking function. We propose to use the zeroth order technique at the neuron level to resolve this dichotomy and use it within the automatic differentiation tool. As a result, we establish a theoretical connection between the proposed local zeroth-order technique and the existing surrogate methods and vice-versa. The proposed method naturally lends itself to energy-efficient training of SNNs on GPUs. Experimental results with neuromorphic datasets show that such implementation requires less than 1 percent neurons to be active in the backward pass, resulting in a 100x speed-up in the backward computation time. Our method offers better generalization compared to the state-of-the-art energy-efficient technique while maintaining similar efficiency.  ( 2 min )
    Vectorized Scenario Description and Motion Prediction for Scenario-Based Testing. (arXiv:2302.01161v1 [cs.LG])
    Automated vehicles (AVs) are tested in diverse scenarios, typically specified by parameters such as velocities, distances, or curve radii. To describe scenarios uniformly independent of such parameters, this paper proposes a vectorized scenario description defined by the road geometry and vehicles' trajectories. Data of this form are generated for three scenarios, merged, and used to train the motion prediction model VectorNet, allowing to predict an AV's trajectory for unseen scenarios. Predicting scenario evaluation metrics, VectorNet partially achieves lower errors than regression models that separately process the three scenarios' data. However, for comprehensive generalization, sufficient variance in the training data must be ensured. Thus, contrary to existing methods, our proposed method can merge diverse scenarios' data and exploit spatial and temporal nuances in the vectorized scenario description. As a result, data from specified test scenarios and real-world scenarios can be compared and combined for (predictive) analyses and scenario selection.  ( 2 min )
    Deep COVID-19 Forecasting for Multiple States with Data Augmentation. (arXiv:2302.01155v1 [cs.LG])
    In this work, we propose a deep learning approach to forecasting state-level COVID-19 trends of weekly cumulative death in the United States (US) and incident cases in Germany. This approach includes a transformer model, an ensemble method, and a data augmentation technique for time series. We arrange the inputs of the transformer in such a way that predictions for different states can attend to the trends of the others. To overcome the issue of scarcity of training data for this COVID-19 pandemic, we have developed a novel data augmentation technique to generate useful data for training. More importantly, the generated data can also be used for model validation. As such, it has a two-fold advantage: 1) more actual observations can be used for training, and 2) the model can be validated on data which has distribution closer to the expected situation. Our model has achieved some of the best state-level results on the COVID-19 Forecast Hub for the US and for Germany.  ( 2 min )
    Predicting the Silent Majority on Graphs: Knowledge Transferable Graph Neural Network. (arXiv:2302.00873v1 [cs.LG])
    Graphs consisting of vocal nodes ("the vocal minority") and silent nodes ("the silent majority"), namely VS-Graph, are ubiquitous in the real world. The vocal nodes tend to have abundant features and labels. In contrast, silent nodes only have incomplete features and rare labels, e.g., the description and political tendency of politicians (vocal) are abundant while not for ordinary people (silent) on the twitter's social network. Predicting the silent majority remains a crucial yet challenging problem. However, most existing message-passing based GNNs assume that all nodes belong to the same domain, without considering the missing features and distribution-shift between domains, leading to poor ability to deal with VS-Graph. To combat the above challenges, we propose Knowledge Transferable Graph Neural Network (KT-GNN), which models distribution shifts during message passing and representation learning by transferring knowledge from vocal nodes to silent nodes. Specifically, we design the domain-adapted "feature completion and message passing mechanism" for node representation learning while preserving domain difference. And a knowledge transferable classifier based on KL-divergence is followed. Comprehensive experiments on real-world scenarios (i.e., company financial risk assessment and political elections) demonstrate the superior performance of our method. Our source code has been open sourced.
    Site-specific Deep Learning Path Loss Models based on the Method of Moments. (arXiv:2302.01052v1 [cs.LG])
    This paper describes deep learning models based on convolutional neural networks applied to the problem of predicting EM wave propagation over rural terrain. A surface integral equation formulation, solved with the method of moments and accelerated using the Fast Far Field approximation, is used to generate synthetic training data which comprises path loss computed over randomly generated 1D terrain profiles. These are used to train two networks, one based on fractal profiles and one based on profiles generated using a Gaussian process. The models show excellent agreement when applied to test profiles generated using the same statistical process used to create the training data and very good accuracy when applied to real life problems.
    adSformers: Personalization from Short-Term Sequences and Diversity of Representations in Etsy Ads. (arXiv:2302.01255v1 [cs.LG])
    In this article, we present our approach to personalizing Etsy Ads through encoding and learning from short-term (one-hour) sequences of user actions and diverse representations. To this end we introduce a three-component adSformer diversifiable personalization module (ADPM) and illustrate how we use this module to derive a short-term dynamic user representation and personalize the Click-Through Rate (CTR) and Post-Click Conversion Rate (PCCVR) models used in sponsored search (ad) ranking. The first component of the ADPM is a custom transformer encoder that learns the inherent structure from the sequence of actions. ADPM's second component enriches the signal through visual, multimodal and textual pretrained representations. Lastly, the third ADPM component includes a "learned" on the fly average pooled representation. The ADPM-personalized CTR and PCCVR models, henceforth referred to as adSformer CTR and adSformer PCCVR, outperform the CTR and PCCVR production baselines by $+6.65\%$ and $+12.70\%$, respectively, in offline Precision-Recall Area Under the Curve (PR AUC). At the time of this writing, following the online gains in A/B tests, such as $+5.34\%$ in return on ad spend, a seller success metric, we are ramping up the adSformers to $100\%$ traffic in Etsy Ads.
    ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs. (arXiv:2302.01275v1 [cs.LG])
    In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems.
    Physics Constrained Motion Prediction with Uncertainty Quantification. (arXiv:2302.01060v1 [cs.RO])
    Predicting the motion of dynamic agents is a critical task for guaranteeing the safety of autonomous systems. A particular challenge is that motion prediction algorithms should obey dynamics constraints and quantify prediction uncertainty as a measure of confidence. We present a physics-constrained approach for motion prediction which uses a surrogate dynamical model to ensure that predicted trajectories are dynamically feasible. We propose a two-step integration consisting of intent and trajectory prediction subject to dynamics constraints. We also construct prediction regions that quantify uncertainty and are tailored for autonomous driving by using conformal prediction, a popular statistical tool. Physics Constrained Motion Prediction achieves a 41% better ADE, 56% better FDE, and 19% better IoU over a baseline in experiments using an autonomous racing dataset.  ( 2 min )
    Randomized Greedy Learning for Non-monotone Stochastic Submodular Maximization Under Full-bandit Feedback. (arXiv:2302.01324v1 [cs.LG])
    We investigate the problem of unconstrained combinatorial multi-armed bandits with full-bandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$-regret upper bound of $\tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings.
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v1 [cs.LG])
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
    Bayesian Inference on Binary Spiking Networks Leveraging Nanoscale Device Stochasticity. (arXiv:2302.01302v1 [cs.NE])
    Bayesian Neural Networks (BNNs) can overcome the problem of overconfidence that plagues traditional frequentist deep neural networks, and are hence considered to be a key enabler for reliable AI systems. However, conventional hardware realizations of BNNs are resource intensive, requiring the implementation of random number generators for synaptic sampling. Owing to their inherent stochasticity during programming and read operations, nanoscale memristive devices can be directly leveraged for sampling, without the need for additional hardware resources. In this paper, we introduce a novel Phase Change Memory (PCM)-based hardware implementation for BNNs with binary synapses. The proposed architecture consists of separate weight and noise planes, in which PCM cells are configured and operated to represent the nominal values of weights and to generate the required noise for sampling, respectively. Using experimentally observed PCM noise characteristics, for the exemplary Breast Cancer Dataset classification problem, we obtain hardware accuracy and expected calibration error matching that of an 8-bit fixed-point (FxP8) implementation, with projected savings of over 9$\times$ in terms of core area transistor count.
    On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance. (arXiv:2302.01029v1 [cs.LG])
    A number of recent adaptive optimizers improve the generalisation performance of Adam by essentially reducing the variance of adaptive stepsizes to get closer to SGD with momentum. Following the above motivation, we suppress the range of the adaptive stepsizes of Adam by exploiting the layerwise gradient statistics. In particular, at each iteration, we propose to perform three consecutive operations on the second momentum v_t before using it to update a DNN model: (1): down-scaling, (2): epsilon-embedding, and (3): down-translating. The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations. The down-scaling operation on v_t is performed layerwise by making use of the angles between the layerwise subvectors of v_t and the corresponding all-one subvectors. Extensive experimental results show that SET-Adam outperforms eight adaptive optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the eight adaptive methods when training WGAN-GP models for image generation tasks. Furthermore, SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.  ( 2 min )
    Exposing the CSI: A Systematic Investigation of CSI-based Wi-Fi Sensing Capabilities and Limitations. (arXiv:2302.00992v1 [cs.NI])
    Thanks to the ubiquitous deployment of Wi-Fi hotspots, channel state information (CSI)-based Wi-Fi sensing can unleash game-changing applications in many fields, such as healthcare, security, and entertainment. However, despite one decade of active research on Wi-Fi sensing, most existing work only considers legacy IEEE 802.11n devices, often in particular and strictly-controlled environments. Worse yet, there is a fundamental lack of understanding of the impact on CSI-based sensing of modern Wi-Fi features, such as 160-MHz bandwidth, multiple-input multiple-output (MIMO) transmissions, and increased spectral resolution in IEEE 802.11ax (Wi-Fi 6). This work aims to shed light on the impact of Wi-Fi 6 features on the sensing performance and to create a benchmark for future research on Wi-Fi sensing. To this end, we perform an extensive CSI data collection campaign involving 3 individuals, 3 environments, and 12 activities, using Wi-Fi 6 signals. An anonymized ground truth obtained through video recording accompanies our 80-GB dataset, which contains almost two hours of CSI data from three collectors. We leverage our dataset to dissect the performance of a state-of-the-art sensing framework across different environments and individuals. Our key findings suggest that (i) MIMO transmissions and higher spectral resolution might be more beneficial than larger bandwidth for sensing applications; (ii) there is a pressing need to standardize research on Wi-Fi sensing because the path towards a truly environment-independent framework is still uncertain. To ease the experiments' replicability and address the current lack of Wi-Fi 6 CSI datasets, we release our 80-GB dataset to the community.  ( 2 min )
    Oracle-Preserving Latent Flows. (arXiv:2302.00806v1 [cs.LG])
    We develop a deep learning methodology for the simultaneous discovery of multiple nontrivial continuous symmetries across an entire labelled dataset. The symmetry transformations and the corresponding generators are modeled with fully connected neural networks trained with a specially constructed loss function ensuring the desired symmetry properties. The two new elements in this work are the use of a reduced-dimensionality latent space and the generalization to transformations invariant with respect to high-dimensional oracles. The method is demonstrated with several examples on the MNIST digit dataset.  ( 2 min )
    Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning. (arXiv:2302.00797v1 [cs.AI])
    Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. One approach, Policy-Space Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via meta-strategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new meta-strategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.
    Average-Constrained Policy Optimization. (arXiv:2302.00808v1 [cs.LG])
    Reinforcement Learning (RL) with constraints is becoming an increasingly important problem for various applications. Often, the average criterion is more suitable. Yet, RL for average criterion-constrained MDPs remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new (possibly the first) policy optimization algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by the famed PPO-type algorithms based on trust region methods. We develop basic sensitivity theory for average MDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging MuJoCo environments, show the superior performance of the algorithm when compared to other state-of-the-art algorithms adapted for the average CMDP setting.
    RobustNeRF: Ignoring Distractors with Robust Losses. (arXiv:2302.00833v1 [cs.CV])
    Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling distractors in training data as outliers of an optimization problem. Our method successfully removes outliers from a scene and improves upon our baselines, on synthetic and real-world scenes. Our technique is simple to incorporate in modern NeRF frameworks, with few hyper-parameters. It does not assume a priori knowledge of the types of distractors, and is instead focused on the optimization problem rather than pre-processing or modeling transient objects. More results on our page https://robustnerf.github.io/public.
    Scale up with Order: Finding Good Data Permutations for Distributed Training. (arXiv:2302.00845v1 [cs.LG])
    Gradient Balancing (GraB) is a recently proposed technique that finds provably better data permutations when training models with multiple epochs over a finite dataset. It converges at a faster rate than the widely adopted Random Reshuffling, by minimizing the discrepancy of the gradients on adjacently selected examples. However, GraB only operates under critical assumptions such as small batch sizes and centralized data, leaving open the question of how to order examples at large scale -- i.e. distributed learning with decentralized data. To alleviate the limitation, in this paper we propose D-GraB that involves two novel designs: (1) $\textsf{PairBalance}$ that eliminates the requirement to use stale gradient mean in GraB which critically relies on small learning rates; (2) an ordering protocol that runs $\textsf{PairBalance}$ in a distributed environment with negligible overhead, which benefits from both data ordering and parallelism. We prove D-GraB enjoys linear speed up at rate $\tilde{O}((mnT)^{-2/3})$ on smooth non-convex objectives and $\tilde{O}((mnT)^{-2})$ under PL condition, where $n$ denotes the number of parallel workers, $m$ denotes the number of examples per worker and $T$ denotes the number of epochs. Empirically, we show on various applications including GLUE, CIFAR10 and WikiText-2 that D-GraB outperforms naive parallel GraB and Distributed Random Reshuffling in terms of both training and validation performance.
    Quantum Graph Learning: Frontiers and Outlook. (arXiv:2302.00892v1 [cs.LG])
    Quantum theory has shown its superiority in enhancing machine learning. However, facilitating quantum theory to enhance graph learning is in its infancy. This survey investigates the current advances in quantum graph learning (QGL) from three perspectives, i.e., underlying theories, methods, and prospects. We first look at QGL and discuss the mutualism of quantum theory and graph learning, the specificity of graph-structured data, and the bottleneck of graph learning, respectively. A new taxonomy of QGL is presented, i.e., quantum computing on graphs, quantum graph representation, and quantum circuits for graph neural networks. Pitfall traps are then highlighted and explained. This survey aims to provide a brief but insightful introduction to this emerging field, along with a detailed discussion of frontiers and outlook yet to be investigated.
    Teaching MLOps in Higher Education through Project-Based Learning. (arXiv:2302.01048v1 [cs.SE])
    Building and maintaining production-grade ML-enabled components is a complex endeavor that goes beyond the current approach of academic education, focused on the optimization of ML model performance in the lab. In this paper, we present a project-based learning approach to teaching MLOps, focused on the demonstration and experience with emerging practices and tools to automatize the construction of ML-enabled components. We examine the design of a course based on this approach, including laboratory sessions that cover the end-to-end ML component life cycle, from model building to production deployment. Moreover, we report on preliminary results from the first edition of the course. During the present year, an updated version of the same course is being delivered in two independent universities; the related learning outcomes will be evaluated to analyze the effectiveness of project-based learning for this specific subject.
    SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling. (arXiv:2302.00861v1 [cs.LG])
    Time series analysis is widely used in extensive areas. Recently, to reduce labeling expenses and benefit various tasks, self-supervised pre-training has attracted immense interest. One mainstream paradigm is masked modeling, which successfully pre-trains deep models by learning to reconstruct the masked content based on the unmasked part. However, since the semantic information of time series is mainly contained in temporal variations, the standard way of randomly masking a portion of time points will ruin vital temporal variations of time series seriously, making the reconstruction task too difficult to guide representation learning. We thus present SimMTM, a Simple pre-training framework for Masked Time-series Modeling. By relating masked modeling to manifold learning, SimMTM proposes to recover masked time points by the weighted aggregation of multiple neighbors outside the manifold, which eases the reconstruction task by assembling ruined but complementary temporal variations from multiple masked series. SimMTM further learns to uncover the local structure of the manifold helpful for masked modeling. Experimentally, SimMTM achieves state-of-the-art fine-tuning performance in two canonical time series analysis tasks: forecasting and classification, covering both in- and cross-domain settings.
    Resilient Binary Neural Network. (arXiv:2302.00956v1 [cs.LG])
    Binary neural networks (BNNs) have received ever-increasing popularity for their great capability of reducing storage burden as well as quickening inference time. However, there is a severe performance drop compared with {real-valued} networks, due to its intrinsic frequent weight oscillation during training. In this paper, we introduce a Resilient Binary Neural Network (ReBNN) to mitigate the frequent oscillation for better BNNs' training. We identify that the weight oscillation mainly stems from the non-parametric scaling factor. To address this issue, we propose to parameterize the scaling factor and introduce a weighted reconstruction loss to build an adaptive training objective. %To the best of our knowledge, it is the first work to solve BNNs based on a dynamically re-weighted loss function. For the first time, we show that the weight oscillation is controlled by the balanced parameter attached to the reconstruction loss, which provides a theoretical foundation to parameterize it in back propagation. Based on this, we learn our ReBNN by {calculating} the {balanced} parameter {based on} its maximum magnitude, which can effectively mitigate the weight oscillation with a resilient training process. Extensive experiments are conducted upon various network models, such as ResNet and Faster-RCNN for computer vision, as well as BERT for natural language processing. The results demonstrate the overwhelming performance of our ReBNN over prior arts. For example, our ReBNN achieves 66.9\% Top-1 accuracy with ResNet-18 backbone on the ImageNet dataset, surpassing existing state-of-the-arts by a significant margin. Our code is open-sourced at https://github.com/SteveTsui/ReBNN.
    Algorithm Design for Online Meta-Learning with Task Boundary Detection. (arXiv:2302.00857v1 [cs.LG])
    Online meta-learning has recently emerged as a marriage between batch meta-learning and online learning, for achieving the capability of quick adaptation on new tasks in a lifelong manner. However, most existing approaches focus on the restrictive setting where the distribution of the online tasks remains fixed with known task boundaries. In this work, we relax these assumptions and propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. More specifically, we first propose two simple but effective detection mechanisms of task switches and distribution shift based on empirical observations, which serve as a key building block for more elegant online model updates in our algorithm: the task switch detection mechanism allows reusing of the best model available for the current task at hand, and the distribution shift detection mechanism differentiates the meta model update in order to preserve the knowledge for in-distribution tasks and quickly learn the new knowledge for out-of-distribution tasks. In particular, our online meta model updates are based only on the current data, which eliminates the need of storing previous data as required in most existing methods. We further show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions. Empirical studies on three different benchmarks clearly demonstrate the significant advantage of our algorithm over related baseline approaches.
    Unpaired Multi-Domain Causal Representation Learning. (arXiv:2302.00993v1 [stat.ML])
    The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our identifiability results into a practical method to recover the shared latent causal graph. Moreover, we study how multiple domains reduce errors in falsely detecting shared causal variables in the finite data setting.
    Deep-Learning Tool for Early Identifying Non-Traumatic Intracranial Hemorrhage Etiology based on CT Scan. (arXiv:2302.00953v1 [eess.IV])
    Background: To develop an artificial intelligence system that can accurately identify acute non-traumatic intracranial hemorrhage (ICH) etiology based on non-contrast CT (NCCT) scans and investigate whether clinicians can benefit from it in a diagnostic setting. Materials and Methods: The deep learning model was developed with 1868 eligible NCCT scans with non-traumatic ICH collected between January 2011 and April 2018. We tested the model on two independent datasets (TT200 and SD 98) collected after April 2018. The model's diagnostic performance was compared with clinicians's performance. We further designed a simulated study to compare the clinicians's performance with and without the deep learning system augmentation. Results: The proposed deep learning system achieved area under the receiver operating curve of 0.986 (95% CI 0.967-1.000) on aneurysms, 0.952 (0.917-0.987) on hypertensive hemorrhage, 0.950 (0.860-1.000) on arteriovenous malformation (AVM), 0.749 (0.586-0.912) on Moyamoya disease (MMD), 0.837 (0.704-0.969) on cavernous malformation (CM), and 0.839 (0.722-0.959) on other causes in TT200 dataset. Given a 90% specificity level, the sensitivities of our model were 97.1% and 90.9% for aneurysm and AVM diagnosis, respectively. The model also shows an impressive generalizability in an independent dataset SD98. The clinicians achieve significant improvements in the sensitivity, specificity, and accuracy of diagnoses of certain hemorrhage etiologies with proposed system augmentation. Conclusions: The proposed deep learning algorithms can be an effective tool for early identification of hemorrhage etiologies based on NCCT scans. It may also provide more information for clinicians for triage and further imaging examination selection.
    Versatile Energy-Based Models for High Energy Physics. (arXiv:2302.00695v1 [cs.LG])
    Energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In accordance with these signs of progress, we build a versatile energy-based model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator, a generic anomalous signal detector, and an augmented event classifier.
  • Open

    Algorithm Design for Online Meta-Learning with Task Boundary Detection. (arXiv:2302.00857v1 [cs.LG])
    Online meta-learning has recently emerged as a marriage between batch meta-learning and online learning, for achieving the capability of quick adaptation on new tasks in a lifelong manner. However, most existing approaches focus on the restrictive setting where the distribution of the online tasks remains fixed with known task boundaries. In this work, we relax these assumptions and propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. More specifically, we first propose two simple but effective detection mechanisms of task switches and distribution shift based on empirical observations, which serve as a key building block for more elegant online model updates in our algorithm: the task switch detection mechanism allows reusing of the best model available for the current task at hand, and the distribution shift detection mechanism differentiates the meta model update in order to preserve the knowledge for in-distribution tasks and quickly learn the new knowledge for out-of-distribution tasks. In particular, our online meta model updates are based only on the current data, which eliminates the need of storing previous data as required in most existing methods. We further show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions. Empirical studies on three different benchmarks clearly demonstrate the significant advantage of our algorithm over related baseline approaches.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v5 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    What Language Reveals about Perception: Distilling Psychophysical Knowledge from Large Language Models. (arXiv:2302.01308v1 [cs.CL])
    Understanding the extent to which the perceptual world can be recovered from language is a fundamental problem in cognitive science. We reformulate this problem as that of distilling psychophysical information from text and show how this can be done by combining large language models (LLMs) with a classic psychophysical method based on similarity judgments. Specifically, we use the prompt auto-completion functionality of GPT3, a state-of-the-art LLM, to elicit similarity scores between stimuli and then apply multidimensional scaling to uncover their underlying psychological space. We test our approach on six perceptual domains and show that the elicited judgments strongly correlate with human data and successfully recover well-known psychophysical structures such as the color wheel and pitch spiral. We also explore meaningful divergences between LLM and human representations. Our work showcases how combining state-of-the-art machine models with well-known cognitive paradigms can shed new light on fundamental questions in perception and language research.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v1 [cs.LG] CROSS LISTED)
    Motivated by time series forecasting, we study Online Linear Optimization (OLO) under the coupling of two problem structures: the domain is unbounded, and the performance of an algorithm is measured by its dynamic regret. Handling either of them requires the regret bound to depend on certain complexity measure of the comparator sequence -- specifically, the comparator norm in unconstrained OLO, and the path length in dynamic regret. In contrast to a recent work (Jacobsen & Cutkosky, 2022) that adapts to the combination of these two complexity measures, we propose an alternative complexity measure by recasting the problem into sparse coding. Adaptivity can be achieved by a simple modular framework, which naturally exploits more intricate prior knowledge of the environment. Along the way, we also present a new gradient adaptive algorithm for static unconstrained OLO, designed using novel continuous time machinery. This could be of independent interest.
    A Machine Learning Approach to Measuring Climate Adaptation. (arXiv:2302.01236v1 [stat.AP])
    I measure adaptation to climate change by comparing elasticities from short-run and long-run changes in damaging weather. I propose a debiased machine learning approach to flexibly measure these elasticities in panel settings. In a simulation exercise, I show that debiased machine learning has considerable benefits relative to standard machine learning or ordinary least squares, particularly in high-dimensional settings. I then measure adaptation to damaging heat exposure in United States corn and soy production. Using rich sets of temperature and precipitation variation, I find evidence that short-run impacts from damaging heat are significantly offset in the long run. I show that this is because the impacts of long-run changes in heat exposure do not follow the same functional form as short-run shocks to heat exposure.
    Interventional and Counterfactual Inference with Diffusion Models. (arXiv:2302.00860v1 [stat.ML])
    We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings to allow for direct sampling under interventions as well as abduction for counterfactuals. We utilize DCM to model structural equations, seeing that diffusion models serve as a natural candidate here since they encode each node to a latent representation, a proxy for the exogenous noise, and offer flexible and accurate modeling to provide reliable causal statements and estimates. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Our theoretical results provide a methodology for analyzing the counterfactual error for general encoder/decoder models which could be of independent interest.
    Correlated Initialization for Correlated Data. (arXiv:2003.04422v2 [cs.LG] UPDATED)
    Spatial data exhibits the property that nearby points are correlated. This also holds for learnt representations across layers, but not for commonly used weight initialization methods. Our theoretical analysis quantifies the learning behavior of weights of a single spatial filter. It is thus in contrast to a large body of work that discusses statistical properties of weights. It shows that uncorrelated initialization (i) might lead to poor convergence behavior and (ii) training of (some) parameters is likely subject to slow convergence. Empirical analysis shows that these findings for a single spatial filter extend to networks with many spatial filters. The impact of (correlated) initialization depends strongly on learning rates and l2-regularization.
    Analysis of Knowledge Transfer in Kernel Regime. (arXiv:2003.13438v3 [cs.LG] UPDATED)
    Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models. (arXiv:2205.07999v2 [stat.ML] UPDATED)
    Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
    Efficient Privacy-Preserving Stochastic Nonconvex Optimization. (arXiv:1910.13659v3 [cs.LG] UPDATED)
    While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while improves the best previous utility guarantee given by Wang et al. (NeurIPS 2017). Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.
    Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery. (arXiv:2302.00755v1 [stat.ML])
    In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient prediction and uncertainty quantification. Standard GPs, however, do not capture structured sparsity on the underlying response surface, which is present in many applications, particularly in the physical sciences. We thus propose a new hierarchical shrinkage GP (HierGP), which incorporates such structure via cumulative shrinkage priors within a GP framework. We show that the HierGP implicitly embeds the well-known principles of effect sparsity, heredity and hierarchy for analysis of experiments, which allows our model to identify structured sparse features from the response surface with limited data. We propose efficient posterior sampling algorithms for model training and prediction, and prove desirable consistency properties for the HierGP. Finally, we demonstrate the improved performance of HierGP over existing models, in a suite of numerical experiments and an application to dynamical system recovery.
    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v1 [stat.ML])
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on downstream datasets that are (i) relatively large, and (ii) similar in distribution to the pretraining data. However, in many applications including personalization, it is crucial to perform well in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and on images from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases by as much as 32$\times$ for CIFAR-100 at $\epsilon=1$. We also find that few-shot non-private models are highly susceptible to membership inference attacks. DP provides clear mitigation against the attacks, but a small $\epsilon$ is required to effectively prevent them. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR federated learning benchmark.
    Learning polytopes with fixed facet directions. (arXiv:2201.03419v4 [math.MG] UPDATED)
    We consider the task of reconstructing polytopes with fixed facet directions from finitely many support function evaluations. We show that for a fixed simplicial normal fan the least-squares estimate is given by a convex quadratic program. We study the geometry of the solution set and give a combinatorial characterization for the uniqueness of the reconstruction in this case. We provide an algorithm that, under mild assumptions, converges to the unknown input shape as the number of noisy support function evaluations increases. We also discuss limitations of our results if the restriction on the normal fan is removed.
    Stochastic Contextual Bandits with Long Horizon Rewards. (arXiv:2302.00814v1 [cs.LG])
    The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
    The Value of Out-of-Distribution Data. (arXiv:2208.10967v3 [cs.LG] UPDATED)
    We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.
    Sample Complexity of Kernel-Based Q-Learning. (arXiv:2302.00727v1 [cs.LG])
    Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
    Unsupervised Learning of Sampling Distributions for Particle Filters. (arXiv:2302.01174v1 [eess.SP])
    Accurate estimation of the states of a nonlinear dynamical system is crucial for their design, synthesis, and analysis. Particle filters are estimators constructed by simulating trajectories from a sampling distribution and averaging them based on their importance weight. For particle filters to be computationally tractable, it must be feasible to simulate the trajectories by drawing from the sampling distribution. Simultaneously, these trajectories need to reflect the reality of the nonlinear dynamical system so that the resulting estimators are accurate. Thus, the crux of particle filters lies in designing sampling distributions that are both easy to sample from and lead to accurate estimators. In this work, we propose to learn the sampling distributions. We put forward four methods for learning sampling distributions from observed measurements. Three of the methods are parametric methods in which we learn the mean and covariance matrix of a multivariate Gaussian distribution; each methods exploits a different aspect of the data (generic, time structure, graph structure). The fourth method is a nonparametric alternative in which we directly learn a transform of a uniform random variable. All four methods are trained in an unsupervised manner by maximizing the likelihood that the states may have produced the observed measurements. Our computational experiments demonstrate that learned sampling distributions exhibit better performance than designed, minimum-degeneracy sampling distributions.
    Avoiding Model Estimation in Robust Markov Decision Processes with a Generative Model. (arXiv:2302.01248v1 [stat.ML])
    Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.  ( 2 min )
    Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control. (arXiv:2302.00839v1 [cs.LG])
    Many real-world multi-label prediction problems involve set-valued predictions that must satisfy specific requirements dictated by downstream usage. We focus on a typical scenario where such requirements, separately encoding \textit{value} and \textit{cost}, compete with each other. For instance, a hospital might expect a smart diagnosis system to capture as many severe, often co-morbid, diseases as possible (the value), while maintaining strict control over incorrect predictions (the cost). We present a general pipeline, dubbed as FavMac, to maximize the value while controlling the cost in such scenarios. FavMac can be combined with almost any multi-label classifier, affording distribution-free theoretical guarantees on cost control. Moreover, unlike prior works, FavMac can handle real-world large-scale applications via a carefully designed online update mechanism, which is of independent interest. Our methodological and theoretical contributions are supported by experiments on several healthcare tasks and synthetic datasets - FavMac furnishes higher value compared with several variants and baselines while maintaining strict cost control.  ( 2 min )
    Causal Lifting and Link Prediction. (arXiv:2302.01198v1 [cs.LG])
    Current state-of-the-art causal models for link prediction assume an underlying set of inherent node factors -- an innate characteristic defined at the node's birth -- that governs the causal evolution of links in the graph. In some causal tasks, however, link formation is path-dependent, i.e., the outcome of link interventions depends on existing links. For instance, in the customer-product graph of an online retailer, the effect of an 85-inch TV ad (treatment) likely depends on whether the costumer already has an 85-inch TV. Unfortunately, existing causal methods are impractical in these scenarios. The cascading functional dependencies between links (due to path dependence) are either unidentifiable or require an impractical number of control variables. In order to remedy this shortcoming, this work develops the first causal model capable of dealing with path dependencies in link prediction. It introduces the concept of causal lifting, an invariance in causal models that, when satisfied, allows the identification of causal link prediction queries using limited interventional data. On the estimation side, we show how structural pairwise embeddings -- a type of symmetry-based joint representation of node pairs in a graph -- exhibit lower bias and correctly represent the causal structure of the task, as opposed to existing node embedding methods, e.g., GNNs and matrix factorization. Finally, we validate our theoretical findings on four datasets under three different scenarios for causal link prediction tasks: knowledge base completion, covariance matrix estimation and consumer-product recommendations.  ( 2 min )
    Neural Estimation of the Rate-Distortion Function With Applications to Operational Source Coding. (arXiv:2204.01612v2 [cs.IT] UPDATED)
    A fundamental question in designing lossy data compression schemes is how well one can do in comparison with the rate-distortion function, which describes the known theoretical limits of lossy compression. Motivated by the empirical success of deep neural network (DNN) compressors on large, real-world data, we investigate methods to estimate the rate-distortion function on such data, which would allow comparison of DNN compressors with optimality. While one could use the empirical distribution of the data and apply the Blahut-Arimoto algorithm, this approach presents several computational challenges and inaccuracies when the datasets are large and high-dimensional, such as the case of modern image datasets. Instead, we re-formulate the rate-distortion objective, and solve the resulting functional optimization problem using neural networks. We apply the resulting rate-distortion estimator, called NERD, on popular image datasets, and provide evidence that NERD can accurately estimate the rate-distortion function. Using our estimate, we show that the rate-distortion achievable by DNN compressors are within several bits of the rate-distortion function for real-world datasets. Additionally, NERD provides access to the rate-distortion achieving channel, as well as samples from its output marginal. Therefore, using recent results in reverse channel coding, we describe how NERD can be used to construct an operational one-shot lossy compression scheme with guarantees on the achievable rate and distortion. Experimental results demonstrate competitive performance with DNN compressors.  ( 2 min )
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v2 [cs.LG] UPDATED)
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.  ( 2 min )
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v2 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.  ( 2 min )
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data. (arXiv:2302.00834v1 [cs.LG])
    We study the interpolation, or memorization, power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints.  ( 2 min )
    Safe Optimization of an Industrial Refrigeration Process Using an Adaptive and Explorative Framework. (arXiv:2211.13019v2 [math.OC] UPDATED)
    Many industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown process characteristics, real-time optimization becomes challenging, particularly for the satisfaction of safety constraints. In this paper, we demonstrate the application of an adaptive and explorative real-time optimization framework to an industrial refrigeration process, where we learn the process characteristics through changes in process control targets and through exploration to satisfy safety constraints. We quantify the uncertainty in unknown compressor characteristics of the refrigeration plant by using Gaussian processes and incorporate this uncertainty into the objective function of the real-time optimization problem as a weighted cost term. We adaptively control the weight of this term to drive exploration. The results of our simulation experiments indicate the proposed approach can help to increase the energy efficiency of the considered refrigeration process, closely approximating the performance of a solution that has complete information about the compressor performance characteristics.  ( 2 min )
    FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features. (arXiv:2302.00787v1 [cs.LG])
    The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
    Oracle-Preserving Latent Flows. (arXiv:2302.00806v1 [cs.LG])
    We develop a deep learning methodology for the simultaneous discovery of multiple nontrivial continuous symmetries across an entire labelled dataset. The symmetry transformations and the corresponding generators are modeled with fully connected neural networks trained with a specially constructed loss function ensuring the desired symmetry properties. The two new elements in this work are the use of a reduced-dimensionality latent space and the generalization to transformations invariant with respect to high-dimensional oracles. The method is demonstrated with several examples on the MNIST digit dataset.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v2 [cs.LG] UPDATED)
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.  ( 2 min )
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v1 [stat.ML])
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where the equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. Our method constitutes an important step towards developing general, transferable algorithms for accelerating MD.  ( 2 min )
    Sketched Ridgeless Linear Regression: The Role of Downsampling. (arXiv:2302.01088v1 [math.ST])
    Overparametrization often helps improve the generalization performance. This paper proposes a dual view of overparametrization suggesting that downsampling may also help generalize. Motivated by this dual view, we characterize two out-of-sample prediction risks of the sketched ridgeless least square estimator in the proportional regime $m\asymp n \asymp p$, where $m$ is the sketching size, $n$ the sample size, and $p$ the feature dimensionality. Our results reveal the statistical role of downsampling. Specifically, downsampling does not always hurt the generalization performance, and may actually help improve it in some cases. We identify the optimal sketching sizes that minimize the out-of-sample prediction risks, and find that the optimally sketched estimator has stabler risk curves that eliminates the peaks of those for the full-sample estimator. We then propose a practical procedure to empirically identify the optimal sketching size. Finally, we extend our results to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.  ( 2 min )
    MonoFlow: Rethinking Divergence GANs via the Perspective of Differential Equations. (arXiv:2302.01075v1 [stat.ML])
    The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing mapping of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework.  ( 2 min )
    The Contextual Lasso: Sparse Linear Models via Deep Neural Networks. (arXiv:2302.00878v1 [stat.ML])
    Sparse linear models are a gold standard tool for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which we wish to explain the model's predictions, and contextual features, which we wish to determine the model's explanations. This dichotomy leads us to propose the contextual lasso, a new statistical estimator that fits a sparse linear model whose sparsity pattern and coefficients can vary with the contextual features. The fitting process involves learning a nonparametric map, realized via a deep neural network, from contextual feature vector to sparse coefficient vector. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. Extensive experiments on real and synthetic data suggest that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.  ( 2 min )
    High-dimensional variable clustering based on sub-asymptotic maxima of a weakly dependent random process. (arXiv:2302.00934v1 [math.ST])
    We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models, which defines population-level clusters based on the independence of the maxima of a multivariate stationary mixing random process among clusters. This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference. We also present an algorithm for recovering the clusters of variables without specifying the number of clusters \emph{a priori}. Our work provides some theoritical insights into the consistency of our algorithm, demonstrating that under certain conditions it can effectively identify clusters in the data with a computational complexity that is polynomial in the dimension. This implies that groups can be learned nonparametrically in which block maxima of a dependent process are only sub-asymptotic.  ( 2 min )
    Epistemic Neural Networks. (arXiv:2107.08924v7 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. The epinet does not fit the traditional framework of Bayesian neural networks. To accommodate development of approaches beyond BNNs, such as the epinet, we introduce the epistemic neural network (ENN) as an interface for models that produce joint predictions.  ( 2 min )
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v2 [cs.LG] UPDATED)
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These changes result in distribution shifts across environments. Attributing model performance changes to specific shifts is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.  ( 2 min )
    Rare Feature Selection in High Dimensions. (arXiv:1803.06675v3 [stat.ME] UPDATED)
    It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.  ( 2 min )
    Do Kernel and Neural Embeddings Help in Training and Generalization?. (arXiv:1905.05095v3 [cs.LG] UPDATED)
    Recent results on optimization and generalization properties of neural networks showed that in a simple two-layer network, the alignment of the labels to the eigenvectors of the corresponding Gram matrix determines the convergence of the optimization during training. Such analyses also provide upper bounds on the generalization error. We experimentally investigate the implications of these results to deeper networks via embeddings. We regard the layers preceding the final hidden layer as producing different representations of the input data which are then fed to the two-layer model. We show that these representations improve both optimization and generalization. In particular, we investigate three kernel representations when fed to the final hidden layer: the Gaussian kernel and its approximation by random Fourier features, kernels designed to imitate representations produced by neural networks and finally an optimal kernel designed to align the data with target labels. The approximated representations induced by these kernels are fed to the neural network and the optimization and generalization properties of the final model are evaluated and compared.  ( 2 min )
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v3 [stat.ML] UPDATED)
    This paper presents new machine learning approaches to approximate the solutions of optimal stopping problems. The key idea of these methods is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees are provided. Our randomized reinforcement learning approach and randomized recurrent neural network approach outperform the state-of-the-art and other relevant machine learning approaches in Markovian and non-Markovian examples, respectively. In particular, we test our approaches on Black-Scholes, Heston, rough Heston and fractional Brownian motion. Moreover, we show that they can also be used to efficiently compute Greeks of American options.  ( 2 min )
    A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models. (arXiv:2302.01217v1 [stat.ML])
    We provide a theoretical justification for sample recovery using diffusion based image inpainting in a linear model setting. While most inpainting algorithms require retraining with each new mask, we prove that diffusion based inpainting generalizes well to unseen masks without retraining. We analyze a recently proposed popular diffusion based inpainting algorithm called RePaint (Lugmayr et al., 2022), and show that it has a bias due to misalignment that hampers sample recovery even in a two-state diffusion process. Motivated by our analysis, we propose a modified RePaint algorithm we call RePaint$^+$ that provably recovers the underlying true sample and enjoys a linear rate of convergence. It achieves this by rectifying the misalignment error present in drift and dispersion of the reverse process. To the best of our knowledge, this is the first linear convergence result for a diffusion based image inpainting algorithm.  ( 2 min )
    Bayesian Optimization of Multiple Objectives with Different Latencies. (arXiv:2302.01310v1 [stat.ML])
    Multi-objective Bayesian optimization aims to find the Pareto front of optimal trade-offs between a set of expensive objectives while collecting as few samples as possible. In some cases, it is possible to evaluate the objectives separately, and a different latency or evaluation cost can be associated with each objective. This presents an opportunity to learn the Pareto front faster by evaluating the cheaper objectives more frequently. We propose a scalarization based knowledge gradient acquisition function which accounts for the different evaluation costs of the objectives. We prove consistency of the algorithm and show empirically that it significantly outperforms a benchmark algorithm which always evaluates both objectives.  ( 2 min )
    Causal Effect Estimation: Recent Advances, Challenges, and Opportunities. (arXiv:2302.00848v1 [cs.LG])
    Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., propensity score, matching, and reweighing) and advanced machine learning approaches (e.g., representation learning, adversarial learning, and graph neural networks). Although the advanced machine learning approaches have shown extraordinary performance in treatment effect estimation, it also comes with a lot of new topics and new research questions. In view of the latest research efforts in the causal inference field, we provide a comprehensive discussion of challenges and opportunities for the three core components of the treatment effect estimation task, i.e., treatment, covariates, and outcome. In addition, we showcase the promising research directions of this topic from multiple perspectives.  ( 2 min )
    Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning. (arXiv:2302.01002v1 [stat.ML])
    We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning.  ( 2 min )
    Robust Estimation under the Wasserstein Distance. (arXiv:2302.01237v1 [stat.ML])
    We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.  ( 2 min )
    Randomized prior wavelet neural operator for uncertainty quantification. (arXiv:2302.01051v1 [stat.ML])
    In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with inherent uncertainty quantification module and hence, is expected to be extremely useful for scientists and engineers alike. RP-WNO utilizes randomized prior networks, which can account for prior information and is easier to implement for large, complex deep-learning architectures than its Bayesian counterpart. Four examples have been solved to test the proposed framework, and the results produced advocate favorably for the efficacy of the proposed framework.  ( 2 min )
    Unpaired Multi-Domain Causal Representation Learning. (arXiv:2302.00993v1 [stat.ML])
    The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our identifiability results into a practical method to recover the shared latent causal graph. Moreover, we study how multiple domains reduce errors in falsely detecting shared causal variables in the finite data setting.  ( 2 min )
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v1 [cs.LG])
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.  ( 2 min )
    Robust multi-item auction design using statistical learning: Overcoming uncertainty in bidders' types distributions. (arXiv:2302.00941v1 [cs.GT])
    This paper presents a novel mechanism design for multi-item auction settings with uncertain bidders' type distributions. Our proposed approach utilizes nonparametric density estimation to accurately estimate bidders' types from historical bids, and is built upon the Vickrey-Clarke-Groves (VCG) mechanism, ensuring satisfaction of Bayesian incentive compatibility (BIC) and $\delta$-individual rationality (IR). To further enhance the efficiency of our mechanism, we introduce two novel strategies for query reduction: a filtering method that screens potential winners' value regions within the confidence intervals generated by our estimated distribution, and a classification strategy that designates the lower bound of an interval as the estimated type when the length is below a threshold value. Simulation experiments conducted on both small-scale and large-scale data demonstrate that our mechanism consistently outperforms existing methods in terms of revenue maximization and query reduction, particularly in large-scale scenarios. This makes our proposed mechanism a highly desirable and effective option for sellers in the realm of multi-item auctions.  ( 2 min )
    Lower Bounds for Learning in Revealing POMDPs. (arXiv:2302.01333v1 [cs.LG])
    This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.  ( 2 min )
    Stochastic optimal transport in Banach Spaces for regularized estimation of multivariate quantiles. (arXiv:2302.00982v1 [math.PR])
    We introduce a new stochastic algorithm for solving entropic optimal transport (EOT) between two absolutely continuous probability measures $\mu$ and $\nu$. Our work is motivated by the specific setting of Monge-Kantorovich quantiles where the source measure $\mu$ is either the uniform distribution on the unit hypercube or the spherical uniform distribution. Using the knowledge of the source measure, we propose to parametrize a Kantorovich dual potential by its Fourier coefficients. In this way, each iteration of our stochastic algorithm reduces to two Fourier transforms that enables us to make use of the Fast Fourier Transform (FFT) in order to implement a fast numerical method to solve EOT. We study the almost sure convergence of our stochastic algorithm that takes its values in an infinite-dimensional Banach space. Then, using numerical experiments, we illustrate the performances of our approach on the computation of regularized Monge-Kantorovich quantiles. In particular, we investigate the potential benefits of entropic regularization for the smooth estimation of multivariate quantiles using data sampled from the target measure $\nu$.  ( 2 min )
    Brazilian tailing dam collapse, retrospective precursory monitoring of InSAR data using spectral analysis of time series. (arXiv:2302.00781v1 [stat.ME])
    Slope failures possess destructive power that can cause significant damage to both life and infrastructure. Monitoring slopes prone to instabilities is therefore critical in mitigating the risk posed by their failure. The purpose of slope monitoring is to detect precursory signs of stability issues, such as changes in the rate of displacement with which a slope is deforming. This information can then be used to predict the timing or probability of an imminent failure in order to provide an early warning. In this study, a more objective, statistical-learning algorithm is proposed to detect and characterise the risk of a slope failure, based on spectral analysis of serially correlated displacement time series data. The algorithm is applied to satellite-based interferometric synthetic radar (InSAR) displacement time series data to retrospectively analyse the risk of the 2019 Brumadinho tailings dam collapse in Brazil. Two potential risk milestones are identified and signs of a definitive but emergent risk (27 February 2018 to 26 August 2018) and imminent risk of collapse of the tailings dam (27 June 2018 to 24 December 2018) are detected by the algorithm. Importantly, this precursory indication of risk of failure is detected as early as at least five months prior to the dam collapse on 25 January 2019. The results of this study demonstrate that the combination of spectral methods and second order statistical properties of InSAR displacement time series data can reveal signs of a transition into an unstable deformation regime, and that this algorithm can provide sufficient early warning that could help mitigate catastrophic slope failures.  ( 2 min )
    A Light-weight CNN Model for Efficient Parkinson's Disease Diagnostics. (arXiv:2302.00973v1 [stat.ML])
    In recent years, deep learning methods have achieved great success in various fields due to their strong performance in practical applications. In this paper, we present a light-weight neural network for Parkinson's disease diagnostics, in which a series of hand-drawn data are collected to distinguish Parkinson's disease patients from healthy control subjects. The proposed model consists of a convolution neural network (CNN) cascading to long-short-term memory (LSTM) to adapt the characteristics of collected time-series signals. To make full use of their advantages, a multilayered LSTM model is firstly used to enrich features which are then concatenated with raw data and fed into a shallow one-dimensional (1D) CNN model for efficient classification. Experimental results show that the proposed model achieves a high-quality diagnostic result over multiple evaluation metrics with much fewer parameters and operations, outperforming conventional methods such as support vector machine (SVM), random forest (RF), lightgbm (LGB) and CNN-based methods.  ( 2 min )
    High-precision regressors for particle physics. (arXiv:2302.00753v1 [physics.comp-ph])
    Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of $10^9$ - $10^{12}$, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than $1\%$ or even $0.1\%$ over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of $10^3$ - $10^6$ over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.  ( 2 min )
    Conditional expectation for missing data imputation. (arXiv:2302.00911v1 [stat.ML])
    Missing data is common in datasets retrieved in various areas, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the logic behind the imputation is explainable, which is especially difficult for complex methods that are for example, based on deep learning. This motivates us to introduce a conditional Distribution based Imputation of Missing Values (DIMV) algorithm. This approach works based on finding the conditional distribution of a feature with missing entries based on the fully observed features. As will be illustrated in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods under comparison; (ii) is explainable; (iii) can provide an approximated confidence region for the missing values in a given sample; (iv) works for both small and large scale data; (v) in many scenarios, does not require a huge number of parameters as deep learning approaches and therefore can be used for mobile devices or web browsers; and (vi) is robust to the normally distributed assumption that its theoretical grounds rely on. In addition to DIMV, we also introduce the DPER* algorithm improving the speed of DPER for estimating the mean and covariance matrix from the data, and we confirm the speed-up via experiments.  ( 2 min )
    Pathologies of Predictive Diversity in Deep Ensembles. (arXiv:2302.00704v1 [cs.LG])
    Classical results establish that ensembles of small models benefit when predictive diversity is encouraged, through bagging, boosting, and similar. Here we demonstrate that this intuition does not carry over to ensembles of deep neural networks used for classification, and in fact the opposite can be true. Unlike regression models or small (unconfident) classifiers, predictions from large (confident) neural networks concentrate in vertices of the probability simplex. Thus, decorrelating these points necessarily moves the ensemble prediction away from vertices, harming confidence and moving points across decision boundaries. Through large scale experiments, we demonstrate that diversity-encouraging regularizers hurt the performance of high-capacity deep ensembles used for classification. Even more surprisingly, discouraging predictive diversity can be beneficial. Together this work strongly suggests that the best strategy for deep ensembles is utilizing more accurate, but likely less diverse, component models.  ( 2 min )
    Versatile Energy-Based Models for High Energy Physics. (arXiv:2302.00695v1 [cs.LG])
    Energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In accordance with these signs of progress, we build a versatile energy-based model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator, a generic anomalous signal detector, and an augmented event classifier.  ( 2 min )

  • Open

    Is the purpose of gamma in Q-learning just to help the q-values converge?
    This might be a bit of a dumb question; so I understand the concept of the discount factor with calculating the sum of expected rewards: the closer gamma is to 1.0, the more emphasis the agent places on future rewards as opposed to its current reward. Generally though this seems to happen with when a return is calculated for an entire episode. In Q-learning different state action pairs are getting updates to their q-values at each step, so there's no sort of monte-carlo return that's being calculated back through an entire episode. Therefore is the purpose of gamma here just to make sure the Q-values converge in an infinite horizon case? submitted by /u/1cedrake [link] [comments]  ( 42 min )
    Why is PPO classified as a policy-based method?
    Hello, I'm fairly new to RL and I'm trying to understand the concepts. I saw the spinningup's classification of RL algorithms and noticed that PPO is classified as a Policy based method. However, I read that PPO has both actor and critic networks and [I] considered it to be a hybrid method. I was wondering that if PPO trains both policy and value networks, why is it considered to be a value-based method? what is the difference between SAC and PPO that SAC is hybrid and PPO is not? Thanks in advance ​ https://preview.redd.it/r05j39rx69ga1.png?width=987&format=png&auto=webp&s=e230bf62944d9d2572e5fe6fbe204ef5eba250b0 submitted by /u/ahmadreza_hadi [link] [comments]  ( 42 min )
    Question on Q-Learning paper
    I've been reading this paper (Financial Trading as a Game: A Deep Reinforcement Learning Approach) and have been wondering about something they try in there. I'm still quite new to Q Learning so maybe I've yet to fully understand things... However, they propose a scheme where at each time step they can calculate the reward for the step that was taken and also the rewards for the other possible actions at that step. Intuitively it makes a lot of sense to me - we can learn more from each step without having to do more random exploration. But I immediately thought that there are probably quite a few areas of RL where we could benefit from the same thing (outside of financial trading). So my question is, given how intuitively smart this approach seems, why isn't it more broadly adopted already? What did I miss? submitted by /u/jarym [link] [comments]  ( 43 min )
    "Autonomous navigation of stratospheric balloons using reinforcement learning", Bellemare et al 2020 [Repost]
    submitted by /u/goolulusaurs [link] [comments]  ( 41 min )
    Why does this PPO implementation calculate the Advantage only once per rollout?
    I am looking at this PPO implementation, which follows the pseudocode given in Spinning Up. This implementation has been really easy to follow and I understand almost everything. However, I am lost in line 103, where the author computes the normalized advantage before the rollout - A_k = (A_k - A_k.mean()) / (A_k.std() + 1e-10) Moreover, within the rollout loop, the author goes ahead to recalculate the value, but uses the original advantage while computing the surrogate losses - for _ in range(self.n_updates_per_iteration): # ALG STEP 6 & 7 # Calculate V_phi and pi_theta(a_t | s_t) V, curr_log_probs = self.evaluate(batch_obs, batch_acts) # Calculate surrogate losses. surr1 = ratios * A_k surr2 = torch.clamp(ratios, 1 - self.clip, 1 + self.clip) * A_k The author also wrote a medium article about this implementation and wrote the following - ​ https://preview.redd.it/195482c373ga1.png?width=845&format=png&auto=webp&s=0624c10056311ce9d31b98fd4563a8f7acff39f8 ​ But in the rollouts, the author updates V(value) without updating A (the advantage). submitted by /u/Academic-Rent7800 [link] [comments]  ( 44 min )
    Minimax with neural network evaluation function
    Is this a thing? To combine game tree search like minimax (or alpha-beta pruning) with neural networks that model the value function of a state? I think Alpha Go did something similar but with Monre Carlo Search Trees and it also had a policy network. How would I go on about training said neural network? I am thinking, first as a supervised task where the target values are heuristic evaluation functions and then finw tuning with some kind of RL but I don't know what. submitted by /u/SupremeChampionOfDi [link] [comments]  ( 44 min )
  • Open

    The Chinese room argument holds that a digital computer executing a program cannot have a "mind", "understanding", or "consciousness", regardless of how intelligently or human-like the program may make the computer behave
    submitted by /u/insaneintheblain [link] [comments]  ( 41 min )
    What do you think are the hard limitations of AI?
    I saw recently a lot of roadblocks that we thought AI will struggle with (Like making art) have easier crossed even with the narrow AI (ML) we have. I feel a lot of limits we thought ai might have like thinking outside the box, understanding concepts, self-awareness, or lacking a 'soul' are all kinda subjective that can be overcome with the invention of AGI and ASI in the coming decades. Then they will grow behind human comprehension. So are there actual hard limitations (if any) Ais will encounter that are actually very hard or maybe never able to overcome? submitted by /u/uswhole [link] [comments]  ( 41 min )
    AI Dream 126 - AI MindStorm (2/6) - EPIC journey
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Can Apple, Amazon or Google create a same chatbot as ChatGPT
    GhatGPT is an amazing tool no doubt even in the prototype version and they are continuously improving at an incredible rate. But the thing is, all these three giants (Apple, Amazon and Google) have their incredible voice assistances. They do similar job of finding a different requested results on web etc. Apple is already adding new health support features. Google maps might make it more helpful for searches related to geographical regions. And again they probably have more user data from millions of users. What make it so different from them and how difficult it would be for them to create their own version as they have a lot more resources and data. submitted by /u/JaurasiD [link] [comments]  ( 41 min )
    Found another directory for AI tools
    submitted by /u/simplir [link] [comments]  ( 40 min )
    Midjourney AI has a new "/blend" feature! pretty cool
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    My best animation so far, everything I've learned is in this one!
    submitted by /u/LincolnOsiris_ [link] [comments]  ( 40 min )
    Amazon is Adding 1000 Robots a Day to Its Workforce
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    CMU Researchers Introduce FROMAGe: An AI Model That Efficiently Bootstraps Frozen Large Language Models (LLMs) To Generate Free-Form Text Interleaved With Images
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
    How to reproduce any human voice
    submitted by /u/visimens-technology [link] [comments]  ( 40 min )
    Easy guide for DreamBooth training and prompts quick on your mobile device with iSee app
    submitted by /u/Wonderful_Neat_1549 [link] [comments]  ( 44 min )
    AskReddit: Looking for an open-source text2music or text2audio model for a web-app project. Early stage in AI discovery, any help much appreciated!
    I'm working on an AI project and looking for an open-source text2music or text2audio model that I can incorporate into a website to experiment with. I'm aware of models like MusicLM and VALL-E but those haven't been released as APIs yet. I've also come across AudioGen and Mousai but same issues there. Does anyone happen to have a suggestion for a text2music model that is OSS and fairly accessible to incorporate into a web app? Looking for the best OSS model out there, but also open to the best text2audio that anyone would recommend - just looking to get text2audio working. Any suggestions from prior experience would be hugely helpful. Thanks very much in advance! submitted by /u/dmalikmusic [link] [comments]  ( 41 min )
    3D aware image synthesis with a spherical background — BALLGAN
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 42 min )
    ChatGPT’s Explosive Popularity Makes It the Fastest-Growing App in Human History
    submitted by /u/Tao_Dragon [link] [comments]  ( 41 min )
    My course on creating a ChatGPT Chrome Extension for GMail, would love your feedback!
    https://www.udemy.com/course/chatgpt-bot/?couponCode=5-DAYS-FREE Hey everyone, I recently made a course about ChatGPT as a fun passion project. This is for anyone who wants to learn how to create automated workflows (using Chrome extensions) with ChatGPT. Specifically, you will create a ChatGPT bot that automatically answers your emails. It is beginner friendly and includes getting some good practice with JavaScript. I hope you enjoy it and I'm looking forward to your feedback/questions :) submitted by /u/neuromodel [link] [comments]  ( 41 min )
    BoyWithUke AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    With apps like Lensa AI leading many to doubt AI's creativity, this TED talk is more relevant than ever
    https://www.youtube.com/watch?v=8TOgN-U0ask&t=1s After the Lensa AI controversy led many people to question whether AI really is creative or is it just "remixing" other artists' copyrighted work used with permission, it has led many to wonder whether AI trained on copyrighted images should be illegal. This talk makes some interesting comparisons which might just mean the answer is no. submitted by /u/BearNo21 [link] [comments]  ( 41 min )
    OpenAI Is Reportedly Launching A ChatGPT App For Android And iOS. Here’s What We Know So Far.
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    1000+ AI tools catalog - any feedback?
    I'm creating https://domore.ai/ - a catalog of 1000+ AI tools. The goal is to provide individuals and organizations with the latest information on AI tools. I'd love to hear any feedback you have for me, so feel free to share your thoughts :) submitted by /u/bart_so [link] [comments]  ( 41 min )
    Rasa entity detection
    I've been trying rasa for some time and got into a problem, the ai detects the intents perfectly but it doesn't pick up the entities. anyone can help me with it? submitted by /u/skychi_ [link] [comments]  ( 40 min )
    OpenAI to Launch ChatGPT Mobile App
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    Researchers at Stanford Introduce Parsel: An Artificial Intelligence AI Framework That Enables Automatic Implementation And Validation of Complex Algorithms With Code Large Language Models LLMs
    submitted by /u/madskills42001 [link] [comments]  ( 41 min )
  • Open

    Validation Accuracy Fluctuation
    Hello everyone, I am trying to implement DenseNet from scratch with some improvement for my project. When I fit my model, I am seeing that my validation accuracy fluctuating but my test accuracy is almost 96%. Is this fluctuation sign of the overfitting or how can I comment about this? I would be appreciate for any help. Thank you in advance. ​ ​ https://preview.redd.it/jtybvrt3z8ga1.png?width=778&format=png&auto=webp&s=75ec1a2ab9e2b60854bc92f361e203d825e9ca62 submitted by /u/Hungry-Engineer-5696 [link] [comments]  ( 41 min )
    BoyWithUke AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    AI Learns the Numbers
    submitted by /u/keghn [link] [comments]  ( 40 min )
  • Open

    [D] Mixing metadata and text in embedding for KNN search?
    Say I wanted to do a KNN similarity search, using a text embedding from a block of text from a PDF. But I also want to find these documents written at a similar time, and with similar title and author. Would it make sense to prepend a written form of this metadata ahead of the document text, and sending that to the embedding? Like “this document titled ABC was written by XYZ on January 1, 2020”. Or would it be better to create a separate embedding for the metadata, and merge the embeddings afterwards? submitted by /u/DeadPukka [link] [comments]  ( 43 min )
    [Project] ideas NLP
    Looking for ideas to start an NLP project, I'd like to explore something not too mainstream or novel to some extent, any ideas or datasets I should check out? submitted by /u/mems_m [link] [comments]  ( 42 min )
    [N] [R] Google announces Dreamix: a model that generates videos when given a prompt and an input image/video.
    submitted by /u/radi-cho [link] [comments]  ( 44 min )
    [P] NLP Q&A Bot Project Guidance
    I have performed below steps and require guidance to proceed further I have extracted and preprocessed the text from PDFs. Performed NER on the extracted text and created a data frame of entities. Created a function to preprocess the query and identified the entities in the question. Now I need guidance or any reference to perform the below steps. Match the entities from the question with the entities in the PDF text and retrieve the paragraph ? Calculate the similarity score for each paragraph and display the relevant paragraph Generate answer from the identified paragraph ? Please also guide me if the approach followed is correct or not ? submitted by /u/sasi_0212 [link] [comments]  ( 43 min )
    [D] Could you use SVD for supervised learning?
    It seems like Singular Value Decomposition is only used for unsupervised learning when trying to reduce the number of features in a high dimensional dataset, but I was wondering why I don't see any articles or literature on using SVD for supervised learning. I know that using a regularization function like Lasso (L1) can get rid of irrelevant features, but I don't see why SVD wouldn't be helpful too. submitted by /u/TemperatureOk6810 [link] [comments]  ( 43 min )
    [R] Coinductive guide to inductive transformer heads
    submitted by /u/adamnemecek [link] [comments]  ( 42 min )
    [R] Grounding Language Models to Images for Multimodal Generation
    submitted by /u/MysteryInc152 [link] [comments]  ( 42 min )
    [R] 3D aware image synthesis with a spherical background — BALLGAN
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 43 min )
    please help a bunch of students?(with pre annotated data set) we were assigned to this task with no prior knowledge of ML i don't know where to begin with we tried a couple of method which ultimately failed id be thankful for anyone who would tell me in steps what to do with this data[D]
    submitted by /u/errorr_unknown [link] [comments]  ( 42 min )
    [P] What tools are available for labelling data for LayoutLMv3?
    I have been working on information extraction from documents, but what I got to know is there are not enough free tools available for labelling data for these kind of tasks. Are there any free tools available for labelling data for LayoutLM models? submitted by /u/TensorDudee [link] [comments]  ( 42 min )
    15 years old and bad at math [D]
    Hi, I've done some programming type things before and I'm interested in learning about ML, to be able to make some basic projects with ML, how good does my math need to be. I get As at school in math but I know that what I'm learning now is pretty basic. I'm just wondering whether I should try learn about ML or wait a few more years for my math skills to improve submitted by /u/Daniel_C_____ [link] [comments]  ( 45 min )
    [N] GitHub CEO on why open source developers should be exempt from the EU’s AI Act
    submitted by /u/EmbarrassedHelp [link] [comments]  ( 43 min )
    [R] Bilingual (or Multilingual) Large Language models are the key to human parity on machine translations even for difficult language pairs and domains (e.g literature). An English-Chinese comparison.
    submitted by /u/MysteryInc152 [link] [comments]  ( 42 min )
    [R] Chinchilla data-optimal scaling laws: In plain English
    submitted by /u/adt [link] [comments]  ( 42 min )
    [D] Purchasing Google Colab Pro
    Hi everyone, I'm currently knees-deep in a ML project with a friend (~4 months of development) and my free compute units on Colab finally ran out. After searching for alternatives, and finding none that work as smoothly as Colab, we've considered to buy a Pro subscription. My question is: How can I share the compute units I'll get from Colab Pro with said friend? Don't want to make the purchase and later realize that I'm the only person with access to those compute units. submitted by /u/RaphDaPingu [link] [comments]  ( 43 min )
    Information Retrieval book recommendations? [D]
    Maybe not a Machine Learning question, but I'm searching for good books about information retrieval. The two primary ones I can find are: - Introduction to Information Retrieval (2008) - Information Retrieval - Implementing and Evaluating Search Engines (2016) ​ They seem a bit old for 2023, but they may still be useful? Do you have any good book recommendations? submitted by /u/Ggronne [link] [comments]  ( 44 min )
    [R] What’s your suggestion for offline RL?
    Hi guys! I read a lot of offline RL papers in last Fall semester and choose it as my course project. Offline RL seems to be a very hot topic in recent years, I believe that the major challenge for offline RL are (i) distribution shift and (ii) overestimation. The second challenge is caused by (i), because the learners/agents will never allow to interact with the true environment and they will too optimistic for unseen state-actions. Hence, there are many papers to address such challenges, e.g., CQL and MOPO. However can these methods handle misleading datasets? Consider the following example. Suppose we have only one state (MAB) and two arms. The reward of the first arm will return 2/3 with probability 1 and the reward model of second arm is Bernoulli distribution with p=1/2. Clearly, choosing the first arm is the best choice. Now, for the dataset, unfortunately, all samples on the second arm received reward 1. Because the agent only can access this misleading dataset, if we use Bayesian methods, then the posterior will give a high score for the second arm. If we use Lower Confidence Bound, we need to count the occurrence of each arm. Then, this is very hard to extend this method to MDPs with arbitrary large state and action space. So, does anyone know a function can capture this uncertainty (caused by the dataset) or can any methods to tell the learner that you’re in a very misleading situation? submitted by /u/AndyMeowMeow [link] [comments]  ( 44 min )

  • Open

    [N] FT: Google invests $300mn in artificial intelligence start-up Anthropic
    From the Financial Times: https://www.ft.com/content/583ead66-467c-4bd5-84d0-ed5df7b5bf9c Unpaywalled: https://archive.is/ciZPV I guess I'm a little surprised, this feels like Google backing a competitor to 1) their own Google Brain teams, and 2) Deepmind. The cynical take might be that they're trying to lock in Anthropic; the same way Microsoft locked in OpenAI. submitted by /u/bikeskata [link] [comments]  ( 47 min )
    [R] Topologically evolving new self-modifying multi-task learning algorithms
    I’ve been developing this idea since I first thought of it in mid December last year. Here’s the elevator pitch (skip to how for technical details): Why? Existing models and learning algorithms are extremely static and unable to generalize across tasks as well as humans or to adapt well to new / changing business requirements. This even applies to the final solutions in recent AutoML (see An Empirical Review of Automated Machine Learning, AutoML: A survey of the state-of-the-art). Beyond being static, most suffer from a need for high-performance systems with large amounts of compute and/or memory. This static and bloated nature not only limits the reusability of code, pipelines and all the computations that went into previous versions of a model architecture upon finding a better one. It…  ( 47 min )
    [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params!
    Paper: https://arxiv.org/abs/2302.00923 Github: https://github.com/amazon-science/mm-cot Twitter: https://paperswithcode.com/top-social Abstract: Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance. https://preview.redd.it/g9eo0f94k1ga1.jpg?width=1331&format=pjpg&auto=webp&s=9b5fc84b424aff7160b69ff7c7a5fad071cbb7d2 https://preview.redd.it/fgboci94k1ga1.jpg?width=1323&format=pjpg&auto=webp&s=35215544d9e0a74881c42503d04b62ab09081af1 https://preview.redd.it/2ojfym94k1ga1.jpg?width=1660&format=pjpg&auto=webp&s=cf040c4f422f6c323e8c4d75474a5881f45a41d1 https://preview.redd.it/k7huem94k1ga1.jpg?width=1326&format=pjpg&auto=webp&s=f4326a5088744d3856e5c5c23311be6348fab924 https://preview.redd.it/05m8rf94k1ga1.jpg?width=658&format=pjpg&auto=webp&s=ac4110e57a49fcea6f8c03571edd391ff71bd13d submitted by /u/Singularian2501 [link] [comments]  ( 47 min )
    [P] I trained an AI model on 120M+ songs from iTunes
    Hey ML Reddit! I just shipped a project I’ve been working on called Maroofy: https://maroofy.com You can search for any song, and it’ll use the song’s audio to find other similar-sounding music. Demo: https://twitter.com/subby_tech/status/1621293770779287554 How does it work? I’ve indexed ~120M+ songs from the iTunes catalog with a custom AI audio model that I built for understanding music. My model analyzes raw music audio as input and produces embedding vectors as output. I then store the embedding vectors for all songs into a vector database, and use semantic search to find similar music! Here are some examples you can try: Fetish (Selena Gomez feat. Gucci Mane) — https://maroofy.com/songs/1563859943 The Medallion Calls (Pirates of the Caribbean) — https://maroofy.com/songs/1440649752 Hope you like it! This is an early work in progress, so would love to hear any questions/feedback/comments! :D submitted by /u/BullyMaguireJr [link] [comments]  ( 52 min )
    [N] Google Open Sources Vizier, Hyperparameter + Blackbox Optimization Service at Scale
    Github: https://github.com/google/vizier Google AI Blog: https://ai.googleblog.com/2023/02/open-source-vizier-towards-reliable-and.html Tweet from Zoubin Ghahramani: https://twitter.com/ZoubinGhahrama1/status/1621321675936768000?s=20&t=ZEuz9oSc_GWYxixtXDskqA submitted by /u/enderlayer [link] [comments]  ( 43 min )
    [D] Understanding Vision Transformer (ViT) - What are the prerequisites?
    Hello everyone, I'm interested in diving into the field of computer vision and I recently came across the concept of Vision Transformer (ViT). I want to understand this concept in depth but I'm not sure what prerequisites I need to have in order to grasp the concept fully. Do I need to have a strong background in Recurrent Neural Networks (RNNs) and Transformer (Attention Is All You Need) to understand ViT, or can I get by just knowing the basics of deep learning and Convolutional Neural Networks (CNNs)? I would really appreciate if someone could shed some light on this and provide some guidance. Thank you in advance! submitted by /u/SAbdusSamad [link] [comments]  ( 7 min )
  • Open

    Created an AI research assistant where you can ask questions about any file (i.e. technical paper, report, etc) in English and automatically get the answer. It's like ChatGPT for your files.
    submitted by /u/HamletsLastLine [link] [comments]  ( 46 min )
    Ilya Sutskever says 40 papers explain 90% of modern AI
    In this article (https://dallasinnovates.com/exclusive-qa-john-carmacks-different-path-to-artificial-general-intelligence/) there is a quote from John Carmack that read: "I asked Ilya Sutskever, OpenAI’s chief scientist, for a reading list. He gave me a list of like 40 research papers and said, ‘If you really learn all of these, you’ll know 90% of what matters today. " My question is, what are these 40 papers? submitted by /u/Gryphx [link] [comments]  ( 42 min )
    Chat with your favorite characters from movies, TV shows, books, history, and more.
    ​ sample chat with my annoyed neighbor I built ChatFAI about a month ago. It's a simple web app that allows you to interact with your favorite characters from movies, TV shows, books, history, and beyond. People are having fun talking to whomever they want to talk to. There is a public characters library and you can also create custom characters based on anyone (or even your imagination). I have been actively improving it and have made it much better recently. So, I wanted to share it here to get feedback. The reason for sharing it here is I want feedback from you all. Let me know if there is anything else I should add or change. Here it is: https://chatfai.com submitted by /u/usamaejazch [link] [comments]  ( 42 min )
  • Open

    Augmented Lagrangian method for constrained MDP or constrained RL?
    Is there any work on applying Augmented Lagrangian method to constrained MDP problems that guarantee the constraint satisfaction as iterations goes? I tried to find but haven't got much result yet. Thanks for sharing any hints! submitted by /u/Sad-Dragonfruit-274 [link] [comments]  ( 41 min )
    reward function
    Hi, my agent is not working well, I feel like my reward function is not efficient. I'm trying to solve a control problem using the reinforcemnt learning, so my reward function was made with the state cost function. for example, reward = previous cost - current cost. so if the agent gets closer to the destination or makes any better control, it receives a positive reward. Otherwise, it will get a negative reward. but I don't think this is efficient.. Can anyone give me advice? Thanks all ​ https://preview.redd.it/rzf78ykzs0ga1.png?width=559&format=png&auto=webp&s=3f4cadb96f16eec59d221ab2b7f50f6839cf8ab6 submitted by /u/sonlightinn [link] [comments]  ( 42 min )
    Does anyone know of any model-based algorithms that deal with imperfect information and stochasticity and don't require a simulator?
    submitted by /u/atomicburn125 [link] [comments]  ( 42 min )
    Why does Advantage Learning help function approximators?
    Can someone please help with this question - https://ai.stackexchange.com/questions/39029/why-does-advantage-learning-help-function-approximators submitted by /u/Academic-Rent7800 [link] [comments]  ( 43 min )
  • Open

    Real-time tracking of wildfire boundaries using satellite imagery
    Posted by Zvika Ben-Haim and Omer Nevo, Software Engineers, Google Research As global temperatures rise, wildfires around the world are becoming more frequent and more dangerous. Their effects are felt by many communities as people evacuate their homes or suffer harm even from proximity to the fire and smoke. As part of Google’s mission to help people access trusted information in critical moments, we use satellite imagery and machine learning (ML) to track wildfires and inform affected communities. Our wildfire tracker was recently expanded. It provides updated fire boundary information every 10–15 minutes, is more accurate than similar satellite products, and improves on our previous work. These boundaries are shown for large fires in the continental US, Mexico, and most of Cana…  ( 92 min )
  • Open

    IoT Project: Why Is .NET The Best Choice?
    As the Internet of Things (IoT) continues to gain more traction at a rapid pace, there is growing demand and need for the development of apps driven by this technology. However, this leaves businesses with a challenging question: which development tool to use for creating such apps? The simple answer is .NET. It is a… Read More »IoT Project: Why Is .NET The Best Choice? The post IoT Project: Why Is .NET The Best Choice? appeared first on Data Science Central.  ( 19 min )
  • Open

    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v2 [cs.LG] UPDATED)
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.  ( 2 min )
    Additive Higher-Order Factorization Machines. (arXiv:2205.14515v2 [stat.CO] UPDATED)
    In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.  ( 2 min )
    Real Estate Property Valuation using Self-Supervised Vision Transformers. (arXiv:2302.00117v1 [cs.CV])
    The use of Artificial Intelligence (AI) in the real estate market has been growing in recent years. In this paper, we propose a new method for property valuation that utilizes self-supervised vision transformers, a recent breakthrough in computer vision and deep learning. Our proposed algorithm uses a combination of machine learning, computer vision and hedonic pricing models trained on real estate data to estimate the value of a given property. We collected and pre-processed a data set of real estate properties in the city of Boulder, Colorado and used it to train, validate and test our algorithm. Our data set consisted of qualitative images (including house interiors, exteriors, and street views) as well as quantitative features such as the number of bedrooms, bathrooms, square footage, lot square footage, property age, crime rates, and proximity to amenities. We evaluated the performance of our model using metrics such as Root Mean Squared Error (RMSE). Our findings indicate that these techniques are able to accurately predict the value of properties, with a low RMSE. The proposed algorithm outperforms traditional appraisal methods that do not leverage property images and has the potential to be used in real-world applications.  ( 2 min )
    Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks. (arXiv:2211.11869v2 [cs.LG] UPDATED)
    This effort is focused on examining the behavior of reinforcement learning systems in personalization environments and detailing the differences in policy entropy associated with the type of learning algorithm utilized. We demonstrate that Policy Optimization agents often possess low-entropy policies during training, which in practice results in agents prioritizing certain actions and avoiding others. Conversely, we also show that Q-Learning agents are far less susceptible to such behavior and generally maintain high-entropy policies throughout training, which is often preferable in real-world applications. We provide a wide range of numerical experiments as well as theoretical justification to show that these differences in entropy are due to the type of learning being employed.  ( 2 min )
    Expanding the Deployment Envelope of Behavior Prediction via Adaptive Meta-Learning. (arXiv:2209.11820v3 [cs.LG] UPDATED)
    Learning-based behavior prediction methods are increasingly being deployed in real-world autonomous systems, e.g., in fleets of self-driving vehicles, which are beginning to commercially operate in major cities across the world. Despite their advancements, however, the vast majority of prediction systems are specialized to a set of well-explored geographic regions or operational design domains, complicating deployment to additional cities, countries, or continents. Towards this end, we present a novel method for efficiently adapting behavior prediction models to new environments. Our approach leverages recent advances in meta-learning, specifically Bayesian regression, to augment existing behavior prediction models with an adaptive layer that enables efficient domain transfer via offline fine-tuning, online adaptation, or both. Experiments across multiple real-world datasets demonstrate that our method can efficiently adapt to a variety of unseen environments.  ( 2 min )
    Sliced Optimal Partial Transport. (arXiv:2212.08049v4 [cs.LG] UPDATED)
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.  ( 2 min )
    PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets. (arXiv:2302.00586v1 [q-fin.TR])
    The financial markets, which involve more than $90 trillion in market capitalization, attract the attention of innumerable investors around the world. Recently, reinforcement learning in financial markets (FinRL) emerges as a promising direction to train agents for making profitable investment decisions. However, the evaluation of most FinRL methods only focus on profit-related measures, which are far from satisfactory for practitioners to deploy these methods into real-world financial markets. Therefore, we introduce PRUDEX-Compass, which has 6 axes, i.e., Profitability, Risk-control, Universality, Diversity, rEliability, and eXplainability, with a total of 17 measures for a systematic evaluation. Specifically, i) we propose AlphaMix+ as a strong FinRL baseline, which leverages Mixture-of-Experts (MoE) and risk-10 sensitive approaches to make diversified risk-aware investment decisions, ii) we11 evaluate 8 widely used FinRL methods in 4 long-term real-world datasets of influential financial markets to demonstrate the usage of our PRUDEX-Compass, iii) PRUDEX-Compass1 together with 4 real-world datasets, standard implementation of 8 FinRL methods and a portfolio management RL environment is released as public resources to facilitate the design and comparison of new FinRL methods. We hope that PRUDEX-Compass can shed light on future FinRL research to prevent untrustworthy results from stagnating FinRL into successful industry deployment.  ( 2 min )
    Statistical Inference After Adaptive Sampling for Longitudinal Data. (arXiv:2202.07098v2 [cs.LG] UPDATED)
    Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the collected user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively collected data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop significant theory for empirical processes on non-i.i.d., adaptively collected, longitudinal data. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms pool data across users to learn to optimize treatment decisions, yet reliable statistical inference is essential for conducting a variety of statistical analyses after the experiment is over.  ( 2 min )
    Thermal Heating in ReRAM Crossbar Arrays: Challenges and Solutions. (arXiv:2212.13707v2 [cs.AR] UPDATED)
    The higher speed, scalability and parallelism offered by ReRAM crossbar arrays foster development of ReRAM-based next generation AI accelerators. At the same time, sensitivity of ReRAM to temperature variations decreases R_on/Roff ratio and negatively affects the achieved accuracy and reliability of the hardware. Various works on temperature-aware optimization and remapping in ReRAM crossbar arrays reported up to 58\% improvement in accuracy and 2.39$\times$ ReRAM lifetime enhancement. This paper classifies the challenges caused by thermal heat, starting from constraints in ReRAM cells' dimensions and characteristics to their placement in the architecture. In addition, it reviews available solutions designed to mitigate the impact of these challenges, including emerging temperature-resilient DNN training methods. Our work also provides a summary of the techniques and their advantages and limitations.  ( 2 min )
    Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees. (arXiv:2302.00037v1 [cs.LG])
    Hierarchical Clustering is a popular unsupervised machine learning method with decades of history and numerous applications. We initiate the study of differentially private approximation algorithms for hierarchical clustering under the rigorous framework introduced by (Dasgupta, 2016). We show strong lower bounds for the problem: that any $\epsilon$-DP algorithm must exhibit $O(|V|^2/ \epsilon)$-additive error for an input dataset $V$. Then, we exhibit a polynomial-time approximation algorithm with $O(|V|^{2.5}/ \epsilon)$-additive error, and an exponential-time algorithm that meets the lower bound. To overcome the lower bound, we focus on the stochastic block model, a popular model of graphs, and, with a separation assumption on the blocks, propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly. Finally, we perform an empirical study of our algorithms and validate their performance.  ( 2 min )
    ImpressLearn: Continual Learning via Combined Task Impressions. (arXiv:2210.01987v2 [cs.CV] UPDATED)
    This work proposes a new method to sequentially train deep neural networks on multiple tasks without suffering catastrophic forgetting, while endowing it with the capability to quickly adapt to unseen tasks. Starting from existing work on network masking (Wortsman et al., 2020), we show that simply learning a linear combination of a small number of task-specific supermasks (impressions) on a randomly initialized backbone network is sufficient to both retain accuracy on previously learned tasks, as well as achieve high accuracy on unseen tasks. In contrast to previous methods, we do not require to generate dedicated masks or contexts for each new task, instead leveraging transfer learning to keep per-task parameter overhead small. Our work illustrates the power of linearly combining individual impressions, each of which fares poorly in isolation, to achieve performance comparable to a dedicated mask. Moreover, even repeated impressions from the same task (homogeneous masks), when combined, can approach the performance of heterogeneous combinations if sufficiently many impressions are used. Our approach scales more efficiently than existing methods, often requiring orders of magnitude fewer parameters and can function without modification even when task identity is missing. In addition, in the setting where task labels are not given at inference, our algorithm gives an often favorable alternative to the one-shot procedure used by Wortsman et al., 2020. We evaluate our method on a number of well-known image classification datasets and network architectures.  ( 2 min )
    Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data. (arXiv:2302.00674v1 [cs.LG])
    Few-shot learning involves learning an effective model from only a few labeled datapoints. The use of a small training set makes it difficult to avoid overfitting but also makes few-shot learning applicable to many important real-world settings. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization. Introducing auxiliary data during few-shot learning leads to essential design choices where hand-designed heuristics can lead to sub-optimal performance. In this work, we focus on automated sampling strategies for FLAD and relate them to the explore-exploit dilemma that is central in multi-armed bandit settings. Based on this connection we propose two algorithms -- EXP3-FLAD and UCB1-FLAD -- and compare them with methods that either explore or exploit, finding that the combination of exploration and exploitation is crucial. Using our proposed algorithms to train T5 yields a 9% absolute improvement over the explicitly multi-task pre-trained T0 model across 11 datasets.
    Learning Equilibria in Matching Markets from Bandit Feedback. (arXiv:2108.08843v2 [cs.LG] UPDATED)
    Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. Classical notions of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) are unfortunately of limited value in the learning setting, given that preferences are inherently uncertain and destabilizing while they are being learned. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty," the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v3 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning. (arXiv:2302.00543v1 [cs.LG])
    Many compression techniques have been proposed to reduce the communication overhead of Federated Learning training procedures. However, these are typically designed for compressing model updates, which are expected to decay throughout training. As a result, such methods are inapplicable to downlink (i.e., from the parameter server to clients) compression in the cross-device setting, where heterogeneous clients $\textit{may appear only once}$ during training and thus must download the model parameters. In this paper, we propose a new framework ($\texttt{DoCoFL}$) for downlink compression in the cross-device federated learning setting. Importantly, $\texttt{DoCoFL}$ can be seamlessly combined with many uplink compression schemes, rendering it suitable for bi-directional compression. Through extensive evaluation, we demonstrate that $\texttt{DoCoFL}$ offers significant bi-directional bandwidth reduction while achieving competitive accuracy to that of $\texttt{FedAvg}$ without compression.
    A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel. (arXiv:2206.12543v2 [stat.ML] UPDATED)
    Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite width NTKs. For networks with O output units (e.g. an O-class classifier), however, the eNTK on N inputs is of size $NO \times NO$, taking $O((NO)^2)$ memory and up to $O((NO)^3)$ computation. Most existing applications have therefore used one of a handful of approximations yielding $N \times N$ kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call "sum of logits", converges to the true eNTK at initialization for any network with a wide final "readout" layer. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.
    The Replicator Dynamic, Chain Components and the Response Graph. (arXiv:2209.15230v2 [cs.GT] UPDATED)
    In this paper we examine the relationship between the flow of the replicator dynamic, the continuum limit of Multiplicative Weights Update, and a game's response graph. We settle an open problem establishing that under the replicator, sink chain components -- a topological notion of long-run outcome of a dynamical system -- always exist and are approximated by the sink connected components of the game's response graph. More specifically, each sink chain component contains a sink connected component of the response graph, as well as all mixed strategy profiles whose support consists of pure profiles in the same connected component, a set we call the content of the connected component. As a corollary, all profiles are chain recurrent in games with strongly connected response graphs. In any two-player game sharing a response graph with a zero-sum game, the sink chain component is unique. In two-player zero-sum and potential games the sink chain components and sink connected components are in a one-to-one correspondence, and we conjecture that this holds in all games.
    Development of deep biological ages aware of morbidity and mortality based on unsupervised and semi-supervised deep learning approaches. (arXiv:2302.00319v1 [cs.LG])
    Background: While deep learning technology, which has the capability of obtaining latent representations based on large-scale data, can be a potential solution for the discovery of a novel aging biomarker, existing deep learning methods for biological age estimation usually depend on chronological ages and lack of consideration of mortality and morbidity that are the most significant outcomes of aging. Methods: This paper proposes a novel deep learning model to learn latent representations of biological aging in regard to subjects' morbidity and mortality. The model utilizes health check-up data in addition to morbidity and mortality information to learn the complex relationships between aging and measured clinical attributes. Findings: The proposed model is evaluated on a large dataset of general populations compared with KDM and other learning-based models. Results demonstrate that biological ages obtained by the proposed model have superior discriminability of subjects' morbidity and mortality.
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v3 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
    Off-the-Grid MARL: a Framework for Dataset Generation with Baselines for Cooperative Offline Multi-Agent Reinforcement Learning. (arXiv:2302.00521v1 [cs.LG])
    Being able to harness the power of large, static datasets for developing autonomous multi-agent systems could unlock enormous value for real-world applications. Many important industrial systems are multi-agent in nature and are difficult to model using bespoke simulators. However, in industry, distributed system processes can often be recorded during operation, and large quantities of demonstrative data can be stored. Offline multi-agent reinforcement learning (MARL) provides a promising paradigm for building effective online controllers from static datasets. However, offline MARL is still in its infancy, and, therefore, lacks standardised benchmarks, baselines and evaluation protocols typically found in more mature subfields of RL. This deficiency makes it difficult for the community to sensibly measure progress. In this work, we aim to fill this gap by releasing \emph{off-the-grid MARL (OG-MARL)}: a framework for generating offline MARL datasets and algorithms. We release an initial set of datasets and baselines for cooperative offline MARL, created using the framework, along with a standardised evaluation protocol. Our datasets provide settings that are characteristic of real-world systems, including complex dynamics, non-stationarity, partial observability, suboptimality and sparse rewards, and are generated from popular online MARL benchmarks. We hope that OG-MARL will serve the community and help steer progress in offline MARL, while also providing an easy entry point for researchers new to the field.
    Uniswap Liquidity Provision: An Online Learning Approach. (arXiv:2302.00610v1 [cs.GT])
    Decentralized Exchanges (DEXs) are new types of marketplaces leveraging Blockchain technology. They allow users to trade assets with Automatic Market Makers (AMM), using funds provided by liquidity providers, removing the need for order books. One such DEX, Uniswap v3, allows liquidity providers to allocate funds more efficiently by specifying an active price interval for their funds. This introduces the problem of finding an optimal strategy for choosing price intervals. We formalize this problem as an online learning problem with non-stochastic rewards. We use regret-minimization methods to show a liquidity provision strategy that guarantees a lower bound on the reward. This is true even for non-stochastic changes to asset pricing, and we express this bound in terms of the trading volume.
    Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code. (arXiv:2111.01097v3 [cs.SE] UPDATED)
    There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and compare its performance with state-of-the-art representations that utilize the rich syntactic and semantic features of input programs. Our preliminary study on the utility of Code2Snapshot in the code summarization and code classification tasks suggests that simple snapshots of input programs have comparable performance to state-of-the-art representations. Interestingly, obscuring input programs have insignificant impacts on the Code2Snapshot performance, suggesting that, for some tasks, neural models may provide high performance by relying merely on the structure of input programs.
    Cross-client Label Propagation for Transductive Federated Learning. (arXiv:2210.06434v2 [cs.LG] UPDATED)
    We present Cross-Client Label Propagation(XCLP), a new method for transductive federated learning. XCLP estimates a data graph jointly from the data of multiple clients and computes labels for the unlabeled data by propagating label information across the graph. To avoid clients having to share their data with anyone, XCLP employs two cryptographically secure protocols: secure Hamming distance computation and secure summation. We demonstrate two distinct applications of XCLP within federated learning. In the first, we use it in a one-shot way to predict labels for unseen test points. In the second, we use it to repeatedly pseudo-label unlabeled training data in a federated semi-supervised setting. Experiments on both real federated and standard benchmark datasets show that in both applications XCLP achieves higher classification accuracy than alternative approaches.
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v2 [cs.LG] UPDATED)
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
    Short-term Prediction and Filtering of Solar Power Using State-Space Gaussian Processes. (arXiv:2302.00388v1 [cs.LG])
    Short-term forecasting of solar photovoltaic energy (PV) production is important for powerplant management. Ideally these forecasts are equipped with error bars, so that downstream decisions can account for uncertainty. To produce predictions with error bars in this setting, we consider Gaussian processes (GPs) for modelling and predicting solar photovoltaic energy production in the UK. A standard application of GP regression on the PV timeseries data is infeasible due to the large data size and non-Gaussianity of PV readings. However, this is made possible by leveraging recent advances in scalable GP inference, in particular, by using the state-space form of GPs, combined with modern variational inference techniques. The resulting model is not only scalable to large datasets but can also handle continuous data streams via Kalman filtering.
    WISE: Wavelet Transformation for Boosting Transformers' Long Sequence Learning Ability. (arXiv:2210.01989v2 [cs.CL] UPDATED)
    Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with a linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. Specifically, we focus on a new paradigm WISE, which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting non-linear transformations in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena benchmark demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer's performance and also significantly outperform Fourier-based methods.
    Graph Neural Network Based Surrogate Model of Physics Simulations for Geometry Design. (arXiv:2302.00557v1 [cs.LG])
    Computational Intelligence (CI) techniques have shown great potential as a surrogate model of expensive physics simulation, with demonstrated ability to make fast predictions, albeit at the expense of accuracy in some cases. For many scientific and engineering problems involving geometrical design, it is desirable for the surrogate models to precisely describe the change in geometry and predict the consequences. In that context, we develop graph neural networks (GNNs) as fast surrogate models for physics simulation, which allow us to directly train the models on 2/3D geometry designs that are represented by an unstructured mesh or point cloud, without the need for any explicit or hand-crafted parameterization. We utilize an encoder-processor-decoder-type architecture which can flexibly make prediction at both node level and graph level. The performance of our proposed GNN-based surrogate model is demonstrated on 2 example applications: feature designs in the domain of additive engineering and airfoil design in the domain of aerodynamics. The models show good accuracy in their predictions on a separate set of test geometries after training, with almost instant prediction speeds, as compared to O(hour) for the high-fidelity simulations required otherwise.
    Deep learning for $\psi$-weakly dependent processes. (arXiv:2302.00333v1 [stat.ML])
    In this paper, we perform deep neural networks for learning $\psi$-weakly dependent processes. Such weak-dependence property includes a class of weak dependence conditions such as mixing, association,$\cdots$ and the setting considered here covers many commonly used situations such as: regression estimation, time series prediction, time series classification,$\cdots$ The consistency of the empirical risk minimization algorithm in the class of deep neural networks predictors is established. We achieve the generalization bound and obtain a learning rate, which is less than $\mathcal{O}(n^{-1/\alpha})$, for all $\alpha > 2 $. Applications to binary time series classification and prediction in affine causal models with exogenous covariates are carried out. Some simulation results are provided, as well as an application to the US recession data.
    Fast and realistic large-scale structure from machine-learning-augmented random field simulations. (arXiv:2205.07898v2 [astro-ph.CO] UPDATED)
    Producing thousands of simulations of the dark matter distribution in the Universe with increasing precision is a challenging but critical task to facilitate the exploitation of current and forthcoming cosmological surveys. Many inexpensive substitutes to full $N$-body simulations have been proposed, even though they often fail to reproduce the statistics of the smaller, non-linear scales. Among these alternatives, a common approximation is represented by the lognormal distribution, which comes with its own limitations as well, while being extremely fast to compute even for high-resolution density fields. In this work, we train a generative deep learning model, mainly made of convolutional layers, to transform projected lognormal dark matter density fields to more realistic dark matter maps, as obtained from full $N$-body simulations. We detail the procedure that we follow to generate highly correlated pairs of lognormal and simulated maps, which we use as our training data, exploiting the information of the Fourier phases. We demonstrate the performance of our model comparing various statistical tests with different field resolutions, redshifts and cosmological parameters, proving its robustness and explaining its current limitations. When evaluated on 100 test maps, the augmented lognormal random fields reproduce the power spectrum up to wavenumbers of $1 \ h \ \rm{Mpc}^{-1}$, and the bispectrum within 10%, and always within the error bars, of the fiducial target simulations. Finally, we describe how we plan to integrate our proposed model with existing tools to yield more accurate spherical random fields for weak lensing analysis.
    A Comprehensive Survey of Continual Learning: Theory, Method and Application. (arXiv:2302.00487v1 [cs.LG])
    To cope with real-world dynamics, an intelligent agent needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance drop of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of continual learning in terms of the current trends, cross-directional prospects and interdisciplinary connections with neuroscience, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.
    Deterministic equivalent and error universality of deep random features learning. (arXiv:2302.00401v1 [stat.ML])
    This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
    Improved Exact and Heuristic Algorithms for Maximum Weight Clique. (arXiv:2302.00458v1 [cs.DS])
    We propose improved exact and heuristic algorithms for solving the maximum weight clique problem, a well-known problem in graph theory with many applications. Our algorithms interleave successful techniques from related work with novel data reduction rules that use local graph structure to identify and remove vertices and edges while retaining the optimal solution. We evaluate our algorithms on a range of synthetic and real-world graphs, and find that they outperform the current state of the art on most inputs. Our data reductions always produce smaller reduced graphs than existing data reductions alone. As a result, our exact algorithm, MWCRedu, finds solutions orders of magnitude faster on naturally weighted, medium-sized map labeling graphs and random hyperbolic graphs. Our heuristic algorithm, MWCPeel, outperforms its competitors on these instances, but is slightly less effective on extremely dense or large instances.
    Accordion: A Communication-Aware Machine Learning Framework for Next Generation Networks. (arXiv:2302.00623v1 [cs.NI])
    In this article, we advocate for the design of ad hoc artificial intelligence (AI)/machine learning (ML) models to facilitate their usage in future smart infrastructures based on communication networks. To motivate this, we first review key operations identified by the 3GPP for transferring AI/ML models through 5G networks and the main existing techniques to reduce their communication overheads. We also present a novel communication-aware ML framework, which we refer to as Accordion, that enables an efficient AI/ML model transfer thanks to an overhauled model training and communication protocol. We demonstrate the communication-related benefits of Accordion, analyse key performance trade-offs, and discuss potential research directions within this realm.
    Towards Implementing Energy-aware Data-driven Intelligence for Smart Health Applications on Mobile Platforms. (arXiv:2302.00514v1 [cs.LG])
    Recent breakthrough technological progressions of powerful mobile computing resources such as low-cost mobile GPUs along with cutting-edge, open-source software architectures have enabled high-performance deep learning on mobile platforms. These advancements have revolutionized the capabilities of today's mobile applications in different dimensions to perform data-driven intelligence locally, particularly for smart health applications. Unlike traditional machine learning (ML) architectures, modern on-device deep learning frameworks are proficient in utilizing computing resources in mobile platforms seamlessly, in terms of producing highly accurate results in less inference time. However, on the flip side, energy resources in a mobile device are typically limited. Hence, whenever a complex Deep Neural Network (DNN) architecture is fed into the on-device deep learning framework, while it achieves high prediction accuracy (and performance), it also urges huge energy demands during the runtime. Therefore, managing these resources efficiently within the spectrum of performance and energy efficiency is the newest challenge for any mobile application featuring data-driven intelligence beyond experimental evaluations. In this paper, first, we provide a timely review of recent advancements in on-device deep learning while empirically evaluating the performance metrics of current state-of-the-art ML architectures and conventional ML approaches with the emphasis given on energy characteristics by deploying them on a smart health application. With that, we are introducing a new framework through an energy-aware, adaptive model comprehension and realization (EAMCR) approach that can be utilized to make more robust and efficient inference decisions based on the available computing/energy resources in the mobile device during the runtime.
    Two for One: Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics. (arXiv:2302.00600v1 [cs.LG])
    Coarse-grained (CG) molecular dynamics enables the study of biological processes at temporal and spatial scales that would be intractable at an atomistic resolution. However, accurately learning a CG force field remains a challenge. In this work, we leverage connections between score-based generative models, force fields and molecular dynamics to learn a CG force field without requiring any force inputs during training. Specifically, we train a diffusion generative model on protein structures from molecular dynamics simulations, and we show that its score function approximates a force field that can directly be used to simulate CG molecular dynamics. While having a vastly simplified training setup compared to previous work, we demonstrate that our approach leads to improved performance across several small- to medium-sized protein simulations, reproducing the CG equilibrium distribution, and preserving dynamics of all-atom simulations such as protein folding events.
    MB-DECTNet: A Model-Based Unrolled Network for Accurate 3D DECT Reconstruction. (arXiv:2302.00577v1 [eess.IV])
    Numerous dual-energy CT (DECT) techniques have been developed in the past few decades. Dual-energy CT (DECT) statistical iterative reconstruction (SIR) has demonstrated its potential for reducing noise and increasing accuracy. Our lab proposed a joint statistical DECT algorithm for stopping power estimation and showed that it outperforms competing image-based material-decomposition methods. However, due to its slow convergence and the high computational cost of projections, the elapsed time of 3D DECT SIR is often not clinically acceptable. Therefore, to improve its convergence, we have embedded DECT SIR into a deep learning model-based unrolled network for 3D DECT reconstruction (MB-DECTNet) that can be trained in an end-to-end fashion. This deep learning-based method is trained to learn the shortcuts between the initial conditions and the stationary points of iterative algorithms while preserving the unbiased estimation property of model-based algorithms. MB-DECTNet is formed by stacking multiple update blocks, each of which consists of a data consistency layer (DC) and a spatial mixer layer, where the spatial mixer layer is the shrunken U-Net, and the DC layer is a one-step update of an arbitrary traditional iterative method. Although the proposed network can be combined with numerous iterative DECT algorithms, we demonstrate its performance with the dual-energy alternating minimization (DEAM). The qualitative result shows that MB-DECTNet with DEAM significantly reduces noise while increasing the resolution of the test image. The quantitative result shows that MB-DECTNet has the potential to estimate attenuation coefficients accurately as traditional statistical algorithms but with a much lower computational cost.
    Distribution free optimality intervals for clustering. (arXiv:2107.14442v2 [stat.ML] UPDATED)
    We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.
    Conditional Flow Matching: Simulation-Free Dynamic Optimal Transport. (arXiv:2302.00482v1 [cs.LG])
    Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have thus far been held back by limitations in their simulation-based maximum likelihood training. In this paper, we introduce a new technique called conditional flow matching (CFM), a simulation-free training objective for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, our CFM objective does not require the source distribution to be Gaussian or require evaluation of its density. Based on this new objective, we also introduce optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks such as inferring single cell dynamics, unsupervised image translation, and Schr\"odinger bridge inference. Code is available at https://github.com/atong01/conditional-flow-matching .
    Optimal Learning of Deep Random Networks of Extensive-width. (arXiv:2302.00375v1 [stat.ML])
    We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We derive a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We contrast these Bayes-optimal errors with the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
    Machine Learning for Visualization Recommendation Systems: Open Challenges and Future Directions. (arXiv:2302.00569v1 [cs.LG])
    Visualization Recommendation Systems (VRS) are a novel and challenging field of study, whose aim is to automatically generate insightful visualizations from data, to support non-expert users in the process of information discovery. Despite its enormous application potential in the era of big data, progress in this area of research is being held back by several obstacles among which are the absence of standardized datasets to train recommendation algorithms, and the difficulty in defining quantitative criteria to assess the effectiveness of the generated plots. In this paper, we aim not only to summarize the state-of-the-art of VRS, but also to outline promising future research directions.
    Graph Neural Operators for Classification of Spatial Transcriptomics Data. (arXiv:2302.00658v1 [cs.LG])
    The inception of spatial transcriptomics has allowed improved comprehension of tissue architectures and the disentanglement of complex underlying biological, physiological, and pathological processes through their positional contexts. Recently, these contexts, and by extension the field, have seen much promise and elucidation with the application of graph learning approaches. In particular, neural operators have risen in regards to learning the mapping between infinite-dimensional function spaces. With basic to deep neural network architectures being data-driven, i.e. dependent on quality data for prediction, neural operators provide robustness by offering generalization among different resolutions despite low quality data. Graph neural operators are a variant that utilize graph networks to learn this mapping between function spaces. The aim of this research is to identify robust machine learning architectures that integrate spatial information to predict tissue types. Under this notion, we propose a study incorporating various graph neural network approaches to validate the efficacy of applying neural operators towards prediction of brain regions in mouse brain tissue samples as a proof of concept towards our purpose. We were able to achieve an F1 score of nearly 72% for the graph neural operator approach which outperformed all baseline and other graph network approaches.
    Experimental observation on a low-rank tensor model for eigenvalue problems. (arXiv:2302.00538v1 [cs.LG])
    Here we utilize a low-rank tensor model (LTM) as a function approximator, combined with the gradient descent method, to solve eigenvalue problems including the Laplacian operator and the harmonic oscillator. Experimental results show the superiority of the polynomial-based low-rank tensor model (PLTM) compared to the tensor neural network (TNN). We also test such low-rank architectures for the classification problem on the MNIST dataset.
    QCRS: Improve Randomized Smoothing using Quasi-Concave Optimization. (arXiv:2302.00209v1 [cs.LG])
    Randomized smoothing is currently the state-of-the-art method that provides certified robustness for deep neural networks. However, it often cannot achieve an adequate certified region on real-world datasets. One way to obtain a larger certified region is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on Cifar10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead.
    GFlowNets for AI-Driven Scientific Discovery. (arXiv:2302.00615v1 [cs.LG])
    Tackling the most pressing problems for humanity, such as the climate crisis and the threat of global pandemics, requires accelerating the pace of scientific discovery. While science has traditionally relied on trial and error and even serendipity to a large extent, the last few decades have seen a surge of data-driven scientific discoveries. However, in order to truly leverage large-scale data sets and high-throughput experimental setups, machine learning methods will need to be further improved and better integrated in the scientific discovery pipeline. A key challenge for current machine learning methods in this context is the efficient exploration of very large search spaces, which requires techniques for estimating reducible (epistemic) uncertainty and generating sets of diverse and informative experiments to perform. This motivated a new probabilistic machine learning framework called GFlowNets, which can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop. GFlowNets learn to sample from a distribution given indirectly by a reward function corresponding to an unnormalized probability, which enables sampling diverse, high-reward candidates. GFlowNets can also be used to form efficient and amortized Bayesian posterior estimators for causal models conditioned on the already acquired experimental data. Having such posterior models can then provide estimators of epistemic uncertainty and information gain that can drive an experimental design policy. Altogether, here we will argue that GFlowNets can become a valuable tool for AI-driven scientific discovery, especially in scenarios of very large candidate spaces where we have access to cheap but inaccurate measurements or to expensive but accurate measurements. This is a common setting in the context of drug and material discovery, which we use as examples throughout the paper.
    Simple yet Effective Gradient-Free Graph Convolutional Networks. (arXiv:2302.00371v1 [cs.LG])
    Linearized Graph Neural Networks (GNNs) have attracted great attention in recent years for graph representation learning. Compared with nonlinear Graph Neural Network (GNN) models, linearized GNNs are much more time-efficient and can achieve comparable performances on typical downstream tasks such as node classification. Although some linearized GNN variants are purposely crafted to mitigate ``over-smoothing", empirical studies demonstrate that they still somehow suffer from this issue. In this paper, we instead relate over-smoothing with the vanishing gradient phenomenon and craft a gradient-free training framework to achieve more efficient and effective linearized GNNs which can significantly overcome over-smoothing and enhance the generalization of the model. The experimental results demonstrate that our methods achieve better and more stable performances on node classification tasks with varying depths and cost much less training time.
    Internally Rewarded Reinforcement Learning. (arXiv:2302.00270v1 [cs.LG])
    We study a class of reinforcement learning problems where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy. This interdependence between the policy and the discriminator leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning, and conversely, an untrained policy impedes discriminator learning. We call this learning setting $\textit{Internally Rewarded Reinforcement Learning}$ (IRRL) as the reward is not provided directly by the environment but $\textit{internally}$ by the discriminator. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
    Stream-based active learning with linear models. (arXiv:2207.09874v3 [stat.ML] UPDATED)
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
    Deep Learning Approach to Predict Hemorrhage in Moyamoya Disease. (arXiv:2302.00188v1 [cs.LG])
    Objective: Reliable tools to predict moyamoya disease (MMD) patients at risk for hemorrhage could have significant value. The aim of this paper is to develop three machine learning classification algorithms to predict hemorrhage in moyamoya disease. Methods: Clinical data of consecutive MMD patients who were admitted to our hospital between 2009 and 2015 were reviewed. Demographics, clinical, radiographic data were analyzed to develop artificial neural network (ANN), support vector machine (SVM), and random forest models. Results: We extracted 33 parameters, including 11 demographic and 22 radiographic features as input for model development. Of all compared classification results, ANN achieved the highest overall accuracy of 75.7% (95% CI, 68.6%-82.8%), followed by SVM with 69.2% (95% CI, 56.9%-81.5%) and random forest with 70.0% (95% CI, 57.0%-83.0%). Conclusions: The proposed ANN framework can be a potential effective tool to predict the possibility of hemorrhage among adult MMD patients based on clinical information and radiographic features.
    Learning from Stochastic Labels. (arXiv:2302.00299v1 [cs.LG])
    Annotating multi-class instances is a crucial task in the field of machine learning. Unfortunately, identifying the correct class label from a long sequence of candidate labels is time-consuming and laborious. To alleviate this problem, we design a novel labeling mechanism called stochastic label. In this setting, stochastic label includes two cases: 1) identify a correct class label from a small number of randomly given labels; 2) annotate the instance with None label when given labels do not contain correct class label. In this paper, we propose a novel suitable approach to learn from these stochastic labels. We obtain an unbiased estimator that utilizes less supervised information in stochastic labels to train a multi-class classifier. Additionally, it is theoretically justifiable by deriving the estimation error bound of the proposed method. Finally, we conduct extensive experiments on widely-used benchmark datasets to validate the superiority of our method by comparing it with existing state-of-the-art methods.
    Learning Choice Functions with Gaussian Processes. (arXiv:2302.00406v1 [cs.LG])
    In consumer theory, ranking available objects by means of preference relations yields the most common description of individual choices. However, preference-based models assume that individuals: (1) give their preferences only between pairs of objects; (2) are always able to pick the best preferred object. In many situations, they may be instead choosing out of a set with more than two elements and, because of lack of information and/or incomparability (objects with contradictory characteristics), they may not able to select a single most preferred object. To address these situations, we need a choice-model which allows an individual to express a set-valued choice. Choice functions provide such a mathematical framework. We propose a Gaussian Process model to learn choice functions from choice-data. The proposed model assumes a multiple utility representation of a choice function based on the concept of Pareto rationalization, and derives a strategy to learn both the number and the values of these latent multiple utilities. Simulation experiments demonstrate that the proposed model outperforms the state-of-the-art methods.
    Local convexity of the TAP free energy and AMP convergence for Z2-synchronization. (arXiv:2106.11428v2 [math.ST] UPDATED)
    We study mean-field variational Bayesian inference using the TAP approach, for Z2-synchronization as a prototypical example of a high-dimensional Bayesian model. We show that for any signal strength $\lambda > 1$ (the weak-recovery threshold), there exists a unique local minimizer of the TAP free energy functional near the mean of the Bayes posterior law. Furthermore, the TAP free energy in a local neighborhood of this minimizer is strongly convex. Consequently, a natural-gradient/mirror-descent algorithm achieves linear convergence to this minimizer from a local initialization, which may be obtained by a constant number of iterates of Approximate Message Passing (AMP). This provides a rigorous foundation for variational inference in high dimensions via minimization of the TAP free energy. We also analyze the finite-sample convergence of AMP, showing that AMP is asymptotically stable at the TAP minimizer for any $\lambda > 1$, and is linearly convergent to this minimizer from a spectral initialization for sufficiently large $\lambda$. Such a guarantee is stronger than results obtainable by state evolution analyses, which only describe a fixed number of AMP iterations in the infinite-sample limit. Our proofs combine the Kac-Rice formula and Sudakov-Fernique Gaussian comparison inequality to analyze the complexity of critical points that satisfy strong convexity and stability conditions within their local neighborhoods.
    CATFL: Certificateless Authentication-based Trustworthy Federated Learning for 6G Semantic Communications. (arXiv:2302.00271v1 [cs.CR])
    Federated learning (FL) provides an emerging approach for collaboratively training semantic encoder/decoder models of semantic communication systems, without private user data leaving the devices. Most existing studies on trustworthy FL aim to eliminate data poisoning threats that are produced by malicious clients, but in many cases, eliminating model poisoning attacks brought by fake servers is also an important objective. In this paper, a certificateless authentication-based trustworthy federated learning (CATFL) framework is proposed, which mutually authenticates the identity of clients and server. In CATFL, each client verifies the server's signature information before accepting the delivered global model to ensure that the global model is not delivered by false servers. On the contrary, the server also verifies the server's signature information before accepting the delivered model updates to ensure that they are submitted by authorized clients. Compared to PKI-based methods, the CATFL can avoid too high certificate management overheads. Meanwhile, the anonymity of clients shields data poisoning attacks, while real-name registration may suffer from user-specific privacy leakage risks. Therefore, a pseudonym generation strategy is also presented in CATFL to achieve a trade-off between identity traceability and user anonymity, which is essential to conditionally prevent from user-specific privacy leakage. Theoretical security analysis and evaluation results validate the superiority of CATFL.
    Bandit Convex Optimisation Revisited: FTRL Achieves $\tilde{O}(t^{1/2})$ Regret. (arXiv:2302.00358v1 [cs.LG])
    We show that a kernel estimator using multiple function evaluations can be easily converted into a sampling-based bandit estimator with expectation equal to the original kernel estimate. Plugging such a bandit estimator into the standard FTRL algorithm yields a bandit convex optimisation algorithm that achieves $\tilde{O}(t^{1/2})$ regret against adversarial time-varying convex loss functions.
    Equivariant Message Passing Neural Network for Crystal Material Discovery. (arXiv:2302.00485v1 [cs.LG])
    Automatic material discovery with desired properties is a fundamental challenge for material sciences. Considerable attention has recently been devoted to generating stable crystal structures. While existing work has shown impressive success on supervised tasks such as property prediction, the progress on unsupervised tasks such as material generation is still hampered by the limited extent to which the equivalent geometric representations of the same crystal are considered. To address this challenge, we propose EMPNN a periodic equivariant message-passing neural network that learns crystal lattice deformation in an unsupervised fashion. Our model equivalently acts on lattice according to the deformation action that must be performed, making it suitable for crystal generation, relaxation and optimisation. We present experimental evaluations that demonstrate the effectiveness of our approach.
    Learning Prototype Classifiers for Long-Tailed Recognition. (arXiv:2302.00491v1 [cs.CV])
    The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that have a tendency to correlate classifier norm with the amount of training data for a given class. On the other hand, Prototype classifiers do not suffer from this shortcoming and can deliver promising results simply using Nearest-Class-Mean (NCM), a special case where prototypes are empirical centroids. However, the potential of Prototype classifiers as an alternative to softmax in LTR is relatively underexplored. In this work, we propose Prototype classifiers, which jointly learn prototypes that minimize average cross-entropy loss based on probability scores from distances to prototypes. We theoretically analyze the properties of Euclidean distance based prototype classifiers that leads to stable gradient-based optimization which is robust to outliers. We further enhance Prototype classifiers by learning channel-dependent temperature parameters to enable independent distance scales along each channel. Our analysis shows that prototypes learned by Prototype classifiers are better separated than empirical centroids. Results on four long-tailed recognition benchmarks show that Prototype classifier outperforms or is comparable to the state-of-the-art methods.
    Robust online active learning. (arXiv:2302.00422v1 [stat.ML])
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.
    Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders. (arXiv:2302.00662v1 [stat.ML])
    Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
    A Survey of Methods, Challenges and Perspectives in Causality. (arXiv:2302.00293v1 [cs.LG])
    The Causality field aims to find systematic methods for uncovering cause-effect relationships. Such methods can find applications in many research fields, justifying a great interest in this domain. Machine Learning models have shown success in a large variety of tasks by extracting correlation patterns from high-dimensional data but still struggle when generalizing out of their initial distribution. As causal engines aim to learn mechanisms that are independent from a data distribution, combining Machine Learning with Causality has the potential to bring benefits to the two fields. In our work, we motivate this assumption and provide applications. We first perform an extensive overview of the theories and methods for Causality from different perspectives. We then provide a deeper look at the connections between Causality and Machine Learning and describe the challenges met by the two domains. We show the early attempts to bring the fields together and the possible perspectives for the future. We finish by providing a large variety of applications for techniques from Causality.
    Quickest Change Detection for Unnormalized Statistical Models. (arXiv:2302.00250v1 [stat.ML])
    Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyv\"arinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.
    Learning Cut Selection for Mixed-Integer Linear Programming via Hierarchical Sequence Model. (arXiv:2302.00244v1 [cs.LG])
    Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications. Cut selection -- which aims to select a proper subset of the candidate cuts to improve the efficiency of solving MILPs -- heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected. Although many modern MILP solvers tackle (P1)-(P2) by manually designed heuristics, machine learning offers a promising approach to learn more effective heuristics from MILPs collected from specific applications. However, many existing learning-based methods focus on learning which cuts should be preferred, neglecting the importance of learning the number of cuts that should be selected. Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred has a significant impact on the efficiency of solving MILPs as well. To address this challenge, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model -- that formulates the cut selection task as a sequence to sequence learning problem -- to learn policies selecting an ordered subset with the size determined by the higher-level model. To the best of our knowledge, HEM is the first method that can tackle (P1)-(P3) in cut selection simultaneously from a data-driven perspective. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs, including MIPLIB 2017. Moreover, experiments demonstrate that HEM well generalizes to MILPs that are significantly larger than those seen during training.
    Fourier series weight in quantum machine learning. (arXiv:2302.00105v1 [quant-ph])
    In this work, we aim to confirm the impact of the Fourier series on the quantum machine learning model. We will propose models, tests, and demonstrations to achieve this objective. We designed a quantum machine learning leveraged on the Hamiltonian encoding. With a subtle change, we performed the trigonometric interpolation, binary and multiclass classifier, and a quantum signal processing application. We also proposed a block diagram of determining approximately the Fourier coefficient based on quantum machine learning. We performed and tested all the proposed models using the Pennylane framework.
    Local transfer learning from one data space to another. (arXiv:2302.00160v1 [cs.LG])
    A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related.
    Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity. (arXiv:2211.07092v3 [stat.ML] UPDATED)
    In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
    Efficient Scopeformer: Towards Scalable and Rich Feature Extraction for Intracranial Hemorrhage Detection. (arXiv:2302.00220v1 [cs.CV])
    The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94\% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models
    Accelerated First-Order Optimization under Nonlinear Constraints. (arXiv:2302.00316v1 [math.OC])
    We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive rates for the convex setting. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.
    Graph-based Time-Series Anomaly Detection: A Survey. (arXiv:2302.00058v1 [cs.LG])
    With the recent advances in technology, a wide range of systems continues to collect a large amount of data over time and thus generating time series. Detecting anomalies in time series data is an important task in various applications such as e-commerce, cybersecurity, and health care monitoring. However, Time-series Anomaly Detection (TSAD) is very challenging as it requires considering both the temporal dependency and the structural dependency. Recent graph-based approaches have made impressive progress in tackling the challenges of this field. In this survey, we conduct a comprehensive and up-to-date review of Graph-based Time-series Anomaly Detection (G-TSAD). First, we explore the significant potential of graph-based methods in identifying different types of anomalies in time series data. Then, we provide a structured and comprehensive review of the state-of-the-art graph anomaly detection techniques in the context of time series. Finally, we discuss the technical challenges and potential future directions for possible improvements in this research field.
    OrthoReg: Improving Graph-regularized MLPs via Orthogonality Regularization. (arXiv:2302.00109v1 [cs.LG])
    Graph Neural Networks (GNNs) are currently dominating in modeling graph-structure data, while their high reliance on graph structure for inference significantly impedes them from widespread applications. By contrast, Graph-regularized MLPs (GR-MLPs) implicitly inject the graph structure information into model weights, while their performance can hardly match that of GNNs in most tasks. This motivates us to study the causes of the limited performance of GR-MLPs. In this paper, we first demonstrate that node embeddings learned from conventional GR-MLPs suffer from dimensional collapse, a phenomenon in which the largest a few eigenvalues dominate the embedding space, through empirical observations and theoretical analysis. As a result, the expressive power of the learned node representations is constrained. We further propose OrthoReg, a novel GR-MLP model to mitigate the dimensional collapse issue. Through a soft regularization loss on the correlation matrix of node embeddings, OrthoReg explicitly encourages orthogonal node representations and thus can naturally avoid dimensionally collapsed representations. Experiments on traditional transductive semi-supervised classification tasks and inductive node classification for cold-start scenarios demonstrate its effectiveness and superiority.
    TwinExplainer: Explaining Predictions of an Automotive Digital Twin. (arXiv:2302.00152v1 [cs.LG])
    Vehicles are complex Cyber Physical Systems (CPS) that operate in a variety of environments, and the likelihood of failure of one or more subsystems, such as the engine, transmission, brakes, and fuel, can result in unscheduled downtime and incur high maintenance or repair costs. In order to prevent these issues, it is crucial to continuously monitor the health of various subsystems and identify abnormal sensor channel behavior. Data-driven Digital Twin (DT) systems are capable of such a task. Current DT technologies utilize various Deep Learning (DL) techniques that are constrained by the lack of justification or explanation for their predictions. This inability of these opaque systems can influence decision-making and raises user trust concerns. This paper presents a solution to this issue, where the TwinExplainer system, with its three-layered architectural pipeline, explains the predictions of an automotive DT. Such a system can assist automotive stakeholders in understanding the global scale of the sensor channels and how they contribute towards generic DT predictions. TwinExplainer can also visualize explanations for both normal and abnormal local predictions computed by the DT.
    Neural Control of Parametric Solutions for High-dimensional Evolution PDEs. (arXiv:2302.00045v1 [math.NA])
    We develop a novel computational framework to approximate solution operators of evolution partial differential equations (PDEs). By employing a general nonlinear reduced-order model, such as a deep neural network, to approximate the solution of a given PDE, we realize that the evolution of the model parameter is a control problem in the parameter space. Based on this observation, we propose to approximate the solution operator of the PDE by learning the control vector field in the parameter space. From any initial value, this control field can steer the parameter to generate a trajectory such that the corresponding reduced-order model solves the PDE. This allows for substantially reduced computational cost to solve the evolution PDE with arbitrary initial conditions. We also develop comprehensive error analysis for the proposed method when solving a large class of semilinear parabolic PDEs. Numerical experiments on different high-dimensional evolution PDEs with various initial conditions demonstrate the promising results of the proposed method.
    Detection of Tomato Ripening Stages using Yolov3-tiny. (arXiv:2302.00164v1 [cs.CV])
    One of the most important agricultural products in Mexico is the tomato (Solanum lycopersicum), which occupies the 4th place national most produced product . Therefore, it is necessary to improve its production, building automatic detection system that detect, classify an keep tacks of the fruits is one way to archieve it. So, in this paper, we address the design of a computer vision system to detect tomatoes at different ripening stages. To solve the problem, we use a neural network-based model for tomato classification and detection. Specifically, we use the YOLOv3-tiny model because it is one of the lightest current deep neural networks. To train it, we perform two grid searches testing several combinations of hyperparameters. Our experiments showed an f1-score of 90.0% in the localization and classification of ripening stages in a custom dataset.
    FI-ODE: Certified and Robust Forward Invariance in Neural ODEs. (arXiv:2210.16940v2 [cs.LG] UPDATED)
    Forward invariance is a long-studied property in control theory that is used to certify that a dynamical system stays within some pre-specified set of states for all time, and also admits robustness guarantees (e.g., the certificate holds under perturbations). We propose a general framework for training and provably certifying robust forward invariance in Neural ODEs. We apply this framework in two settings: certified adversarial robustness for image classification, and certified safety in continuous control. Notably, our method empirically produces superior adversarial robustness guarantees compared to prior work on certifiably robust Neural ODEs (including implicit-depth models).
    Stable Target Field for Reduced Variance Score Estimation in Diffusion Models. (arXiv:2302.00670v1 [cs.LG])
    Diffusion models generate samples by reversing a fixed forward diffusion process. Despite already providing impressive empirical results, these diffusion models algorithms can be further improved by reducing the variance of the training targets in their denoising score-matching objective. We argue that the source of such variance lies in the handling of intermediate noise-variance scales, where multiple modes in the data affect the direction of reverse paths. We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets. We show that the procedure indeed helps in the challenging intermediate regime by reducing (the trace of) the covariance of training targets. The new stable targets can be seen as trading bias for reduced variance, where the bias vanishes with increasing reference batch size. Empirically, we show that the new objective improves the image quality, stability, and training speed of various popular diffusion models across datasets with both general ODE and SDE solvers. When used in combination with EDM, our method yields a current SOTA FID of 1.90 with 35 network evaluations on the unconditional CIFAR-10 generation task. The code is available at https://github.com/Newbeeer/stf
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v2 [stat.ML] UPDATED)
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.
    Tensor networks for unsupervised machine learning. (arXiv:2106.12974v2 [cond-mat.stat-mech] UPDATED)
    Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have the advantages of a principle understanding of the expressive power using entanglement properties, and as a bridge connecting classical computation and quantum computation. Despite the great potential, however, existing tensor network models for unsupervised machine learning only work as a proof of principle, as their performance is much worse than the standard models such as restricted Boltzmann machines and neural networks. In this Letter, we present autoregressive matrix product states (AMPS), a tensor network model combining matrix product states from quantum many-body physics and autoregressive modeling from machine learning. Our model enjoys the exact calculation of normalized probability and unbiased sampling. We demonstrate the performance of our model using two applications, generative modeling on synthetic and real-world data, and reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor network models and the restricted Boltzmann machines, and is competitive with state-of-the-art neural network models.
    How Out-of-Distribution Data Hurts Semi-Supervised Learning. (arXiv:2010.03658v3 [cs.LG] UPDATED)
    Recent semi-supervised learning algorithms have demonstrated greater success with higher overall performance due to better-unlabeled data representations. Nonetheless, recent research suggests that the performance of the SSL algorithm can be degraded when the unlabeled set contains out-of-distribution examples (OODs). This work addresses the following question: How do out-of-distribution (OOD) data adversely affect semi-supervised learning algorithms? To answer this question, we investigate the critical causes of OOD's negative effect on SSL algorithms. In particular, we found that 1) certain kinds of OOD data instances that are close to the decision boundary have a more significant impact on performance than those that are further away, and 2) Batch Normalization (BN), a popular module, may degrade rather than improve performance when the unlabeled set contains OODs. In this context, we developed a unified weighted robust SSL framework that can be easily extended to many existing SSL algorithms and improve their robustness against OODs. More specifically, we developed an efficient bi-level optimization algorithm that could accommodate high-order approximations of the objective and scale to multiple inner optimization steps to learn a massive number of weight parameters while outperforming existing low-order approximations of bi-level optimization. Further, we conduct a theoretical study of the impact of faraway OODs in the BN step and propose a weighted batch normalization (WBN) procedure for improved performance. Finally, we discuss the connection between our approach and low-order approximation techniques. Our experiments on synthetic and real-world datasets demonstrate that our proposed approach significantly enhances the robustness of four representative SSL algorithms against OODs compared to four state-of-the-art robust SSL strategies.
    Variational Causal Inference. (arXiv:2209.05935v2 [stat.ML] UPDATED)
    Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest.
    Efficient Meta-Learning via Error-based Context Pruning for Implicit Neural Representations. (arXiv:2302.00617v1 [cs.LG])
    We introduce an efficient optimization-based meta-learning technique for learning large-scale implicit neural representations (INRs). Our main idea is designing an online selection of context points, which can significantly reduce memory requirements for meta-learning in any established setting. By doing so, we expect additional memory savings which allows longer per-signal adaptation horizons (at a given memory budget), leading to better meta-initializations by reducing myopia and, more crucially, enabling learning on high-dimensional signals. To implement such context pruning, our technical novelty is three-fold. First, we propose a selection scheme that adaptively chooses a subset at each adaptation step based on the predictive error, leading to the modeling of the global structure of the signal in early steps and enabling the later steps to capture its high-frequency details. Second, we counteract any possible information loss from context pruning by minimizing the parameter distance to a bootstrapped target model trained on a full context set. Finally, we suggest using the full context set with a gradient scaling scheme at test-time. Our technique is model-agnostic, intuitive, and straightforward to implement, showing significant reconstruction improvements for a wide range of signals. Code is available at https://github.com/jihoontack/ECoP
    A Fair Empirical Risk Minimization with Generalized Entropy. (arXiv:2202.11966v3 [cs.LG] UPDATED)
    This paper studies a parametric family of algorithmic fairness metrics, called generalized entropy, which originally has been used in public welfare and recently introduced to machine learning community. As a meaningful metric to evaluate algorithmic fairness, it requires that generalized entropy specify fairness requirements of a classification problem and the fairness requirements should be realized with small deviation by an algorithm. We investigate the role of generalized entropy as a design parameter for fair classification algorithm through a fair empirical risk minimization with a constraint specified in terms of generalized entropy. We theoretically and experimentally study learnability of the problem.
    Anderson Acceleration For Bioinformatics-Based Machine Learning. (arXiv:2302.00347v1 [cs.LG])
    Anderson acceleration (AA) is a well-known method for accelerating the convergence of iterative algorithms, with applications in various fields including deep learning and optimization. Despite its popularity in these areas, the effectiveness of AA in classical machine learning classifiers has not been thoroughly studied. Tabular data, in particular, presents a unique challenge for deep learning models, and classical machine learning models are known to perform better in these scenarios. However, the convergence analysis of these models has received limited attention. To address this gap in research, we implement a support vector machine (SVM) classifier variant that incorporates AA to speed up convergence. We evaluate the performance of our SVM with and without Anderson acceleration on several datasets from the biology domain and demonstrate that the use of AA significantly improves convergence and reduces the training loss as the number of iterations increases. Our findings provide a promising perspective on the potential of Anderson acceleration in the training of simple machine learning classifiers and underscore the importance of further research in this area. By showing the effectiveness of AA in this setting, we aim to inspire more studies that explore the applications of AA in classical machine learning.
    Pessimistic Off-Policy Optimization for Learning to Rank. (arXiv:2206.02593v3 [cs.LG] UPDATED)
    Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient and we analyze it. We study its Bayesian and frequentist variants, and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines, is robust, and is also general.
    HCR-Net: A deep learning based script independent handwritten character recognition network. (arXiv:2108.06663v3 [cs.CV] UPDATED)
    Despite being studied extensively for a few decades, handwritten character recognition (HCR) is still considered a challenging learning problem in pattern recognition, and there is very limited research on script independent models. This is mainly because of similarity in structure of characters, different handwriting styles, noisy datasets, diversity of scripts, focus of the conventional research on handcrafted feature extraction techniques, and unavailability of public datasets and code-repositories to reproduce the results. On the other hand, deep learning has witnessed huge success in different areas of pattern recognition, including HCR, and provides an end-to-end learning. However, deep learning techniques are computationally expensive, need large amount of data for training and have been developed for specific scripts only. To address the above limitations, we have proposed a novel generic deep learning architecture for script independent handwritten character recognition, called HCR-Net. HCR-Net is based on a novel transfer learning approach for HCR, which partly utilizes feature extraction layers of a pre-trained network. Due to transfer learning and image-augmentation, HCR-Net provides faster and computationally efficient training, better performance and better generalizations, and can work with small datasets. HCR-Net is extensively evaluated on 40 publicly available datasets of Bangla, Punjabi, Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam, Telugu, Marathi, Nepali and Arabic languages, and established 26 new benchmark results while performed close to the best results in the rest cases. HCR-Net showed performance improvements up to 11% against the existing results and achieved a fast convergence rate showing up to 99% of final performance in the very first epoch. HCR-Net significantly outperformed the state-of-the-art transfer learning techniques...
    Diffusion Models for High-Resolution Solar Forecasts. (arXiv:2302.00170v1 [cs.LG])
    Forecasting future weather and climate is inherently difficult. Machine learning offers new approaches to increase the accuracy and computational efficiency of forecasts, but current methods are unable to accurately model uncertainty in high-dimensional predictions. Score-based diffusion models offer a new approach to modeling probability distributions over many dependent variables, and in this work, we demonstrate how they provide probabilistic forecasts of weather and climate variables at unprecedented resolution, speed, and accuracy. We apply the technique to day-ahead solar irradiance forecasts by generating many samples from a diffusion model trained to super-resolve coarse-resolution numerical weather predictions to high-resolution weather satellite observations.
    Deep Power Laws for Hyperparameter Optimization. (arXiv:2302.00441v1 [cs.LG])
    Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the scaling law property of learning curves. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 57 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.
    Analyzing Leakage of Personally Identifiable Information in Language Models. (arXiv:2302.00539v1 [cs.LG])
    Language Models (LMs) have been shown to leak information about training data through sentence-level membership inference and reconstruction attacks. Understanding the risk of LMs leaking Personally Identifiable Information (PII) has received less attention, which can be attributed to the false assumption that dataset curation techniques such as scrubbing are sufficient to prevent PII leakage. Scrubbing techniques reduce but do not prevent the risk of PII leakage: in practice scrubbing is imperfect and must balance the trade-off between minimizing disclosure and preserving the utility of the dataset. On the other hand, it is unclear to which extent algorithmic defenses such as differential privacy, designed to guarantee sentence- or user-level privacy, prevent PII disclosure. In this work, we propose (i) a taxonomy of PII leakage in LMs, (ii) metrics to quantify PII leakage, and (iii) attacks showing that PII leakage is a threat in practice. Our taxonomy provides rigorous game-based definitions for PII leakage via black-box extraction, inference, and reconstruction attacks with only API access to an LM. We empirically evaluate attacks against GPT-2 models fine-tuned on three domains: case law, health care, and e-mails. Our main contributions are (i) novel attacks that can extract up to 10 times more PII sequences as existing attacks, (ii) showing that sentence-level differential privacy reduces the risk of PII disclosure but still leaks about 3% of PII sequences, and (iii) a subtle connection between record-level membership inference and PII reconstruction.
    Efficient Multi-Task Reinforcement Learning via Selective Behavior Sharing. (arXiv:2302.00671v1 [cs.LG])
    The ability to leverage shared behaviors between tasks is critical for sample-efficient multi-task reinforcement learning (MTRL). While prior methods have primarily explored parameter and data sharing, direct behavior-sharing has been limited to task families requiring similar behaviors. Our goal is to extend the efficacy of behavior-sharing to more general task families that could require a mix of shareable and conflicting behaviors. Our key insight is an agent's behavior across tasks can be used for mutually beneficial exploration. To this end, we propose a simple MTRL framework for identifying shareable behaviors over tasks and incorporating them to guide exploration. We empirically demonstrate how behavior sharing improves sample efficiency and final performance on manipulation and navigation MTRL tasks and is even complementary to parameter sharing. Result videos are available at https://sites.google.com/view/qmp-mtrl.
    The Power of External Memory in Increasing Predictive Model Capacity. (arXiv:2302.00003v1 [cs.LG])
    One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for accessing the table and how are the contents of the table consumed? Prominent methods for accessing the table include 1) using words/wordpieces token-ids as table indices, 2) LSH hashing the token vector in each layer into a table of buckets, and 3) learnable softmax style routing to a table entry. The ways to consume the contents include adding/concatenating to input representation, and using the contents as expert networks that specialize to different inputs. In this work, we conduct rigorous experimental evaluations of existing ideas and their combinations. We also introduce a new method, alternating updates, that enables access to an increased token dimension without increasing the computation time, and demonstrate its effectiveness in language modeling.
    Model-Parallel Fourier Neural Operators as Learned Surrogates for Large-Scale Parametric PDEs. (arXiv:2204.01205v3 [cs.LG] UPDATED)
    Fourier neural operators (FNOs) are a recently introduced neural network architecture for learning solution operators of partial differential equations (PDEs), which have been shown to perform significantly better than comparable deep learning approaches. Once trained, FNOs can achieve speed-ups of multiple orders of magnitude over conventional numerical PDE solvers. However, due to the high dimensionality of their input data and network weights, FNOs have so far only been applied to two-dimensional or small three-dimensional problems. To remove this limited problem-size barrier, we propose a model-parallel version of FNOs based on domain-decomposition of both the input data and network weights. We demonstrate that our model-parallel FNO is able to predict time-varying PDE solutions of over 2.6 billion variables on Perlmutter using up to 512 A100 GPUs and show an example of training a distributed FNO on the Azure cloud for simulating multiphase CO$_2$ dynamics in the Earth's subsurface.
    Deep Dependency Networks for Multi-Label Classification. (arXiv:2302.00633v1 [cs.LG])
    We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label classification task, focusing specifically on image and video data. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, integer linear programming, and $\ell_1$ regularization-based structure learning. Then we propose a new modeling framework called deep dependency networks, which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, jointly learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video activity classification datasets: Charades, Textually Annotated Cooking Scenes (TACoS), and Wetlab, and three multi-label image classification datasets: MS-COCO, PASCAL VOC, and NUS-WIDE show that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.
    Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification. (arXiv:2302.00368v1 [cs.CV])
    We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of knowledge or domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semi-supervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE render it practical to be used with any off-the-shelf trained model to improve its predictions further.
    Generative Adversarial Symmetry Discovery. (arXiv:2302.00236v1 [cs.LG])
    Despite the success of equivariant neural networks in scientific applications, they require knowing the symmetry group a priori. However, it may be difficult to know the right symmetry to use as an inductive bias in practice and enforcing the wrong symmetry could hurt the performance. In this paper, we propose a framework, LieGAN, to automatically discover equivariances from a dataset using a paradigm akin to generative adversarial training. Specifically, a generator learns a group of transformations applied to the data, which preserves the original distribution and fools the discriminator. LieGAN represents symmetry as interpretable Lie algebra basis and can discover various symmetries such as rotation group $\mathrm{SO}(n)$ and restricted Lorentz group $\mathrm{SO}(1,3)^+$ in trajectory prediction and top quark tagging tasks. The learned symmetry can also be readily used in several existing equivariant neural networks to improve accuracy and generalization in prediction.
    Detecting Harmful Agendas in News Articles. (arXiv:2302.00102v1 [cs.CL])
    Manipulated news online is a growing problem which necessitates the use of automated systems to curtail its spread. We argue that while misinformation and disinformation detection have been studied, there has been a lack of investment in the important open challenge of detecting harmful agendas in news articles; identifying harmful agendas is critical to flag news campaigns with the greatest potential for real world harm. Moreover, due to real concerns around censorship, harmful agenda detectors must be interpretable to be effective. In this work, we propose this new task and release a dataset, NewsAgendas, of annotated news articles for agenda identification. We show how interpretable systems can be effective on this task and demonstrate that they can perform comparably to black-box models.
    Free Lunch for Domain Adversarial Training: Environment Label Smoothing. (arXiv:2302.00194v1 [cs.LG])
    A fundamental challenge for machine learning models is how to generalize learned models for out-of-distribution (OOD) data. Among various approaches, exploiting invariant features by Domain Adversarial Training (DAT) received widespread attention. Despite its success, we observe training instability from DAT, mostly due to over-confident domain discriminator and environment label noise. To address this issue, we proposed Environment Label Smoothing (ELS), which encourages the discriminator to output soft probability, which thus reduces the confidence of the discriminator and alleviates the impact of noisy environment labels. We demonstrate, both experimentally and theoretically, that ELS can improve training stability, local convergence, and robustness to noisy environment labels. By incorporating ELS with DAT methods, we are able to yield state-of-art results on a wide range of domain generalization/adaptation tasks, particularly when the environment labels are highly noisy.
    The Parametric Stability of Well-separated Spherical Gaussian Mixtures. (arXiv:2302.00242v1 [stat.ML])
    We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bound only depends on $P$. The motivation for this work lies in providing guarantees for fitting Gaussian mixtures; with this aim in mind, all the constants involved are well defined and distribution free conditions for fitting mixtures of spherical Gaussians. Our results tighten considerably the existing computable bounds, and asymptotically match the known sharp thresholds for this problem.
    Knowledge Distillation on Graphs: A Survey. (arXiv:2302.00219v1 [cs.LG])
    Graph Neural Networks (GNNs) have attracted tremendous attention by demonstrating their capability to handle graph data. However, they are difficult to be deployed in resource-limited devices due to model sizes and scalability constraints imposed by the multi-hop data dependency. In addition, real-world graphs usually possess complex structural information and features. Therefore, to improve the applicability of GNNs and fully encode the complicated topological information, knowledge distillation on graphs (KDG) has been introduced to build a smaller yet effective model and exploit more knowledge from data, leading to model compression and performance improvement. Recently, KDG has achieved considerable progress with many studies proposed. In this survey, we systematically review these works. Specifically, we first introduce KDG challenges and bases, then categorize and summarize existing works of KDG by answering the following three questions: 1) what to distillate, 2) who to whom, and 3) how to distillate. Finally, we share our thoughts on future research directions.
    Towards Label-Efficient Incremental Learning: A Survey. (arXiv:2302.00353v1 [cs.LG])
    The current dominant paradigm when building a machine learning model is to iterate over a dataset over and over until convergence. Such an approach is non-incremental, as it assumes access to all images of all categories at once. However, for many applications, non-incremental learning is unrealistic. To that end, researchers study incremental learning, where a learner is required to adapt to an incoming stream of data with a varying distribution while preventing forgetting of past knowledge. Significant progress has been made, however, the vast majority of works focus on the fully supervised setting, making these algorithms label-hungry thus limiting their real-life deployment. To that end, in this paper, we make the first attempt to survey recently growing interest in label-efficient incremental learning. We identify three subdivisions, namely semi-, few-shot- and self-supervised learning to reduce labeling efforts. Finally, we identify novel directions that can further enhance label-efficiency and improve incremental learning scalability. Project website: {https://github.com/kilickaya/label-efficient-il.
    Density peak clustering using tensor network. (arXiv:2302.00192v1 [cs.LG])
    Tensor networks, which have been traditionally used to simulate many-body physics, have recently gained significant attention in the field of machine learning due to their powerful representation capabilities. In this work, we propose a density-based clustering algorithm inspired by tensor networks. We encode classical data into tensor network states on an extended Hilbert space and train the tensor network states to capture the features of the clusters. Here, we define density and related concepts in terms of fidelity, rather than using a classical distance measure. We evaluate the performance of our algorithm on six synthetic data sets, four real world data sets, and three commonly used computer vision data sets. The results demonstrate that our method provides state-of-the-art performance on several synthetic data sets and real world data sets, even when the number of clusters is unknown. Additionally, our algorithm performs competitively with state-of-the-art algorithms on the MNIST, USPS, and Fashion-MNIST image data sets. These findings reveal the great potential of tensor networks for machine learning applications.
    Bridging Physics-Informed Neural Networks with Reinforcement Learning: Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO). (arXiv:2302.00237v1 [cs.LG])
    This paper introduces the Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO) algorithm into reinforcement learning. The Hamilton-Jacobi-Bellman (HJB) equation is used in control theory to evaluate the optimality of the value function. Our work combines the HJB equation with reinforcement learning in continuous state and action spaces to improve the training of the value network. We treat the value network as a Physics-Informed Neural Network (PINN) to solve for the HJB equation by computing its derivatives with respect to its inputs exactly. The Proximal Policy Optimization (PPO)-Clipped algorithm is improvised with this implementation as it uses a value network to compute the objective function for its policy network. The HJBPPO algorithm shows an improved performance compared to PPO on the MuJoCo environments.
    Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression. (arXiv:2302.00257v1 [cs.LG])
    In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$ with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.
    W2SAT: Learning to generate SAT instances from Weighted Literal Incidence Graphs. (arXiv:2302.00272v1 [cs.LG])
    The Boolean Satisfiability (SAT) problem stands out as an attractive NP-complete problem in theoretic computer science and plays a central role in a broad spectrum of computing-related applications. Exploiting and tuning SAT solvers under numerous scenarios require massive high-quality industry-level SAT instances, which unfortunately are quite limited in the real world. To address the data insufficiency issue, in this paper, we propose W2SAT, a framework to generate SAT formulas by learning intrinsic structures and properties from given real-world/industrial instances in an implicit fashion. To this end, we introduce a novel SAT representation called Weighted Literal Incidence Graph (WLIG), which exhibits strong representation ability and generalizability against existing counterparts, and can be efficiently generated via a specialized learning-based graph generative model. Decoding from WLIGs into SAT problems is then modeled as finding overlapping cliques with a novel hill-climbing optimization method termed Optimal Weight Coverage (OWC). Experiments demonstrate the superiority of our WLIG-induced approach in terms of graph metrics, efficiency, and scalability in comparison to previous methods. Additionally, we discuss the limitations of graph-based SAT generation for real-world applications, especially when utilizing generated instances for SAT solver parameter-tuning, and pose some potential directions.
    The geometry of hidden representations of large transformer models. (arXiv:2302.00294v1 [cs.LG])
    Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.
    Deep Active Learning for Scientific Computing in the Wild. (arXiv:2302.00098v1 [cs.LG])
    Deep learning (DL) is revolutionizing the scientific computing community. To reduce the data gap caused by usually expensive simulations or experimentation, active learning has been identified as a promising solution for the scientific computing community. However, the deep active learning (DAL) literature is currently dominated by image classification problems and pool-based methods, which are not directly transferrable to scientific computing problems, dominated by regression problems with no pre-defined 'pool' of unlabeled data. Here for the first time, we investigate the robustness of DAL methods for scientific computing problems using ten state-of-the-art DAL methods and eight benchmark problems. We show that, to our surprise, the majority of the DAL methods are not robust even compared to random sampling when the ideal pool size is unknown. We further analyze the effectiveness and robustness of DAL methods and suggest that diversity is necessary for a robust DAL for scientific computing problems.
    Stroke-based Rendering: From Heuristics to Deep Learning. (arXiv:2302.00595v1 [cs.CV])
    In the last few years, artistic image-making with deep learning models has gained a considerable amount of traction. A large number of these models operate directly in the pixel space and generate raster images. This is however not how most humans would produce artworks, for example, by planning a sequence of shapes and strokes to draw. Recent developments in deep learning methods help to bridge the gap between stroke-based paintings and pixel photo generation. With this survey, we aim to provide a structured introduction and understanding of common challenges and approaches in stroke-based rendering algorithms. These algorithms range from simple rule-based heuristics to stroke optimization and deep reinforcement agents, trained to paint images with differentiable vector graphics and neural rendering.
    Dynamic Flows on Curved Space Generated by Labeled Data. (arXiv:2302.00061v1 [cs.LG])
    The scarcity of labeled data is a long-standing challenge for many machine learning tasks. We propose our gradient flow method to leverage the existing dataset (i.e., source) to generate new samples that are close to the dataset of interest (i.e., target). We lift both datasets to the space of probability distributions on the feature-Gaussian manifold, and then develop a gradient flow method that minimizes the maximum mean discrepancy loss. To perform the gradient flow of distributions on the curved feature-Gaussian space, we unravel the Riemannian structure of the space and compute explicitly the Riemannian gradient of the loss function induced by the optimal transport metric. For practical applications, we also propose a discretized flow, and provide conditional results guaranteeing the global convergence of the flow to the optimum. We illustrate the results of our proposed gradient flow method on several real-world datasets and show our method can improve the accuracy of classification models in transfer learning settings.
    Active Uncertainty Reduction for Safe and Efficient Interaction Planning: A Shielding-Aware Dual Control Approach. (arXiv:2302.00171v1 [cs.RO])
    The ability to accurately predict the opponent's behavior is central to the safety and efficiency of robotic systems in interactive settings, such as human-robot interaction and multi-robot teaming tasks. Unfortunately, robots often lack access to key information on which these predictions may hinge, such as opponent's goals, attention, and willingness to cooperate. Dual control theory addresses this challenge by treating unknown parameters of a predictive model as hidden states and inferring their values at runtime using information gathered during system operation. While able to optimally and automatically trade off exploration and exploitation, dual control is computationally intractable for general interactive motion planning. In this paper, we present a novel algorithmic approach to enable active uncertainty reduction for interactive motion planning based on the implicit dual control paradigm. Our approach relies on sampling-based approximation of stochastic dynamic programming, leading to a model predictive control problem. The resulting policy is shown to preserve the dual control effect for a broad class of predictive models with both continuous and categorical uncertainty. To ensure the safe operation of the interacting agents, we leverage a supervisory control scheme, oftentimes referred to as ``shielding'', which overrides the ego agent's dual control policy with a safety fallback strategy when a safety-critical event is imminent. We then augment the dual control framework with an improved variant of the recently proposed shielding-aware robust planning scheme, which proactively balances the nominal planning performance with the risk of high-cost emergency maneuvers triggered by low-probability opponent's behaviors. We demonstrate the efficacy of our approach with both simulated driving examples and hardware experiments using 1/10 scale autonomous vehicles.
    Probabilistic Point Cloud Modeling via Self-Organizing Gaussian Mixture Models. (arXiv:2302.00047v1 [cs.LG])
    This letter presents a continuous probabilistic modeling methodology for spatial point cloud data using finite Gaussian Mixture Models (GMMs) where the number of components are adapted based on the scene complexity. Few hierarchical and adaptive methods have been proposed to address the challenge of balancing model fidelity with size. Instead, state-of-the-art mapping approaches require tuning parameters for specific use cases, but do not generalize across diverse environments. To address this gap, we utilize a self-organizing principle from information-theoretic learning to automatically adapt the complexity of the GMM model based on the relevant information in the sensor data. The approach is evaluated against existing point cloud modeling techniques on real-world data with varying degrees of scene complexity.
    Debiasing Vision-Language Models via Biased Prompts. (arXiv:2302.00070v1 [cs.LG])
    Machine learning models have been shown to inherit biases from their training datasets, which can be particularly problematic for vision-language foundation models trained on uncurated datasets scraped from the internet. The biases can be amplified and propagated to downstream applications like zero-shot classifiers and text-to-image generative models. In this study, we propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. In particular, we show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models. The closed-form solution enables easy integration into large-scale pipelines, and empirical results demonstrate that our approach effectively reduces social bias and spurious correlation in both discriminative and generative vision-language models without the need for additional data or training.
    Online Learning in Dynamically Changing Environments. (arXiv:2302.00103v1 [cs.LG])
    We study the problem of online learning and online regret minimization when samples are drawn from a general unknown non-stationary process. We introduce the concept of a dynamic changing process with cost $K$, where the conditional marginals of the process can vary arbitrarily, but that the number of different conditional marginals is bounded by $K$ over $T$ rounds. For such processes we prove a tight (upto $\sqrt{\log T}$ factor) bound $O(\sqrt{KT\cdot\mathsf{VC}(\mathcal{H})\log T})$ for the expected worst case regret of any finite VC-dimensional class $\mathcal{H}$ under absolute loss (i.e., the expected miss-classification loss). We then improve this bound for general mixable losses, by establishing a tight (up to $\log^3 T$ factor) regret bound $O(K\cdot\mathsf{VC}(\mathcal{H})\log^3 T)$. We extend these results to general smooth adversary processes with unknown reference measure by showing a sub-linear regret bound for $1$-dimensional threshold functions under a general bounded convex loss. Our results can be viewed as a first step towards regret analysis with non-stationary samples in the distribution blind (universal) regime. This also brings a new viewpoint that shifts the study of complexity of the hypothesis classes to the study of the complexity of processes generating data.
    ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference Pipeline. (arXiv:2212.05428v2 [cs.CR] UPDATED)
    Machine Learning as a service (MLaaS) permits resource-limited clients to access powerful data analytics services ubiquitously. Despite its merits, MLaaS poses significant concerns regarding the integrity of delegated computation and the privacy of the server's model parameters. To address this issue, Zhang et al. (CCS'20) initiated the study of zero-knowledge Machine Learning (zkML). Few zkML schemes have been proposed afterward; however, they focus on sole ML classification algorithms that may not offer satisfactory accuracy or require large-scale training data and model parameters, which may not be desirable for some applications. We propose ezDPS, a new efficient and zero-knowledge ML inference scheme. Unlike prior works, ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy. Each stage of ezDPS is harnessed with an established ML algorithm that is shown to be effective in various applications, including Discrete Wavelet Transformation, Principal Components Analysis, and Support Vector Machine. We design new gadgets to prove ML operations effectively. We fully implemented ezDPS and assessed its performance on real datasets. Experimental results showed that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics while maintaining more desirable accuracy than single ML classification approaches.
    Truthful Incentive Mechanism for Federated Learning with Crowdsourced Data Labeling. (arXiv:2302.00106v1 [cs.LG])
    Federated learning (FL) has emerged as a promising paradigm that trains machine learning (ML) models on clients' devices in a distributed manner without the need of transmitting clients' data to the FL server. In many applications of ML, the labels of training data need to be generated manually by human agents. In this paper, we study FL with crowdsourced data labeling where the local data of each participating client of FL are labeled manually by the client. We consider the strategic behavior of clients who may not make desired effort in their local data labeling and local model computation and may misreport their local models to the FL server. We characterize the performance bounds on the training loss as a function of clients' data labeling effort, local computation effort, and reported local models. We devise truthful incentive mechanisms which incentivize strategic clients to make truthful efforts and report true local models to the server. The truthful design exploits the non-trivial dependence of the training loss on clients' efforts and local models. Under the truthful mechanisms, we characterize the server's optimal local computation effort assignments. We evaluate the proposed FL algorithms with crowdsourced data labeling and the incentive mechanisms using experiments.
    ADAPT : Awesome Domain Adaptation Python Toolbox. (arXiv:2107.03049v2 [cs.LG] UPDATED)
    In this paper, we introduce the ADAPT library, an open source Python API providing the implementation of the main transfer learning and domain adaptation methods. The library is designed with a user friendly approach to facilitate the access to domain adaptation for a wide public. ADAPT is compatible with scikit-learn and TensorFlow and a full documentation is proposed online https://adapt-python.github.io/adapt/ with a substantial gallery of examples.
    Learning Optimal Fair Classification Trees: Trade-offs Between Interpretability, Fairness, and Accuracy. (arXiv:2201.09932v3 [cs.LG] UPDATED)
    The increasing use of machine learning in high-stakes domains -- where people's livelihoods are impacted -- creates an urgent need for interpretable, fair, and highly accurate algorithms. With these needs in mind, we propose a mixed integer optimization (MIO) framework for learning optimal classification trees -- one of the most interpretable models -- that can be augmented with arbitrary fairness constraints. In order to better quantify the "price of interpretability", we also propose a new measure of model interpretability called decision complexity that allows for comparisons across different classes of machine learning models. We benchmark our method against state-of-the-art approaches for fair classification on popular datasets; in doing so, we conduct one of the first comprehensive analyses of the trade-offs between interpretability, fairness, and predictive accuracy. Given a fixed disparity threshold, our method has a price of interpretability of about 4.2 percentage points in terms of out-of-sample accuracy compared to the best performing, complex models. However, our method consistently finds decisions with almost full parity, while other methods rarely do.
    Multi-Grade Deep Learning. (arXiv:2302.00150v1 [cs.LG])
    The current deep learning model is of a single-grade, that is, it learns a deep neural network by solving a single nonconvex optimization problem. When the layer number of the neural network is large, it is computationally challenging to carry out such a task efficiently. Inspired by the human education process which arranges learning in grades, we propose a multi-grade learning model: We successively solve a number of optimization problems of small sizes, which are organized in grades, to learn a shallow neural network for each grade. Specifically, the current grade is to learn the leftover from the previous grade. In each of the grades, we learn a shallow neural network stacked on the top of the neural network, learned in the previous grades, which remains unchanged in training of the current and future grades. By dividing the task of learning a deep neural network into learning several shallow neural networks, one can alleviate the severity of the nonconvexity of the original optimization problem of a large size. When all grades of the learning are completed, the final neural network learned is a stair-shape neural network, which is the superposition of networks learned from all grades. Such a model enables us to learn a deep neural network much more effectively and efficiently. Moreover, multi-grade learning naturally leads to adaptive learning. We prove that in the context of function approximation if the neural network generated by a new grade is nontrivial, the optimal error of the grade is strictly reduced from the optimal error of the previous grade. Furthermore, we provide several proof-of-concept numerical examples which demonstrate that the proposed multi-grade model outperforms significantly the traditional single-grade model and is much more robust than the traditional model.
    FLSTRA: Federated Learning in Stratosphere. (arXiv:2302.00163v1 [cs.NI])
    We propose a federated learning (FL) in stratosphere (FLSTRA) system, where a high altitude platform station (HAPS) felicitates a large number of terrestrial clients to collaboratively learn a global model without sharing the training data. FLSTRA overcomes the challenges faced by FL in terrestrial networks, such as slow convergence and high communication delay due to limited client participation and multi-hop communications. HAPS leverages its altitude and size to allow the participation of more clients with line-of-sight (LoS) links and the placement of a powerful server. However, handling many clients at once introduces computing and transmission delays. Thus, we aim to obtain a delay-accuracy trade-off for FLSTRA. Specifically, we first develop a joint client selection and resource allocation algorithm for uplink and downlink to minimize the FL delay subject to the energy and quality-of-service (QoS) constraints. Second, we propose a communication and computation resource-aware (CCRA-FL) algorithm to achieve the target FL accuracy while deriving an upper bound for its convergence rate. The formulated problem is non-convex; thus, we propose an iterative algorithm to solve it. Simulation results demonstrate the effectiveness of the proposed FLSTRA system, compared to terrestrial benchmarks, in terms of FL delay and accuracy.
    Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training. (arXiv:2302.00286v1 [cs.SD])
    In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. %We also argue that transcription models can be used as a preprocessing module for other music analysis tasks. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist.
    Epic-Sounds: A Large-scale Dataset of Actions That Sound. (arXiv:2302.00646v1 [cs.SD])
    We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound.
    Transformers Meet Directed Graphs. (arXiv:2302.00049v1 [cs.LG])
    Transformers were originally proposed as a sequence-to-sequence model for text but have become vital for a wide range of modalities, including images, audio, video, and undirected graphs. However, transformers for directed graphs are a surprisingly underexplored topic, despite their applicability to ubiquitous domains including source code and logic circuits. In this work, we propose two direction- and structure-aware positional encodings for directed graphs: (1) the eigenvectors of the Magnetic Laplacian - a direction-aware generalization of the combinatorial Laplacian; (2) directional random walk encodings. Empirically, we show that the extra directionality information is useful in various downstream tasks, including correctness testing of sorting networks and source code understanding. Together with a data-flow-centric graph construction, our model outperforms the prior state of the art on the Open Graph Benchmark Code2 relatively by 14.7%.
    Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP. (arXiv:2208.05516v4 [cs.LG] UPDATED)
    Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.
    Distributed sequential federated learning. (arXiv:2302.00107v1 [stat.ML])
    The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.
    On the Within-Group Discrimination of Screening Classifiers. (arXiv:2302.00025v1 [cs.LG])
    Screening classifiers are increasingly used to identify qualified candidates in a variety of selection processes. In this context, it has been recently shown that, if a classifier is calibrated, one can identify the smallest set of candidates which contains, in expectation, a desired number of qualified candidates using a threshold decision rule. This lends support to focusing on calibration as the only requirement for screening classifiers. In this paper, we argue that screening policies that use calibrated classifiers may suffer from an understudied type of within-group discrimination -- they may discriminate against qualified members within demographic groups of interest. Further, we argue that this type of discrimination can be avoided if classifiers satisfy within-group monotonicity, a natural monotonicity property within each of the groups. Then, we introduce an efficient post-processing algorithm based on dynamic programming to minimally modify a given calibrated classifier so that its probability estimates satisfy within-group monotonicity. We validate our algorithm using US Census survey data and show that within-group monotonicity can be often achieved at a small cost in terms of prediction granularity and shortlist size.
    Towards Answering Open-ended Ethical Quandary Questions. (arXiv:2205.05989v3 [cs.CL] UPDATED)
    Considerable advancements have been made in various NLP tasks based on the impressive power of large language models (LLMs) and many NLP applications are deployed in our daily lives. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We explore the current capability of LLMs in providing an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. We propose a model that searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based few-shot learning. We also discuss the remaining challenges and ethical issues involved in this task and suggest the direction toward developing responsible NLP systems by incorporating human values explicitly.
    HOAX: A Hyperparameter Optimization Algorithm Explorer for Neural Networks. (arXiv:2302.00374v1 [physics.chem-ph])
    Computational chemistry has become an important tool to predict and understand molecular properties and reactions. Even though recent years have seen a significant growth in new algorithms and computational methods that speed up quantum chemical calculations, the bottleneck for trajectory-based methods to study photoinduced processes is still the huge number of electronic structure calculations. In this work, we present an innovative solution, in which the amount of electronic structure calculations is drastically reduced, by employing machine learning algorithms and methods borrowed from the realm of artificial intelligence. However, applying these algorithms effectively requires finding optimal hyperparameters, which remains a challenge itself. Here we present an automated user-friendly framework, HOAX, to perform the hyperparameter optimization for neural networks, which bypasses the need for a lengthy manual process. The neural network generated potential energy surfaces (PESs) reduces the computational costs compared to the ab initio-based PESs. We perform a comparative investigation on the performance of different hyperparameter optimiziation algorithms, namely grid search, simulated annealing, genetic algorithm, and bayesian optimizer in finding the optimal hyperparameters necessary for constructing the well-performing neural network in order to fit the PESs of small organic molecules. Our results show that this automated toolkit not only facilitate a straightforward way to perform the hyperparameter optimization but also the resulting neural networks-based generated PESs are in reasonable agreement with the ab initio-based PESs.
    Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization. (arXiv:2302.00138v1 [cs.LG])
    Generative models have the ability to synthesize data points drawn from the data distribution, however, not all generated samples are high quality. In this paper, we propose using a combination of coresets selection methods and ``entropic regularization'' to select the highest fidelity samples. We leverage an Energy-Based Model which resembles a variational auto-encoder with an inference and generator model for which the latent prior is complexified by an energy-based model. In a semi-supervised learning scenario, we show that augmenting the labeled data-set, by adding our selected subset of samples, leads to better accuracy improvement rather than using all the synthetic samples.
    Training Normalizing Flows with the Precision-Recall Divergence. (arXiv:2302.00628v1 [cs.LG])
    Generative models can have distinct mode of failures like mode dropping and low quality samples, which cannot be captured by a single scalar metric. To address this, recent works propose evaluating generative models using precision and recall, where precision measures quality of samples and recall measures the coverage of the target distribution. Although a variety of discrepancy measures between the target and estimated distribution are used to train generative models, it is unclear what precision-recall trade-offs are achieved by various choices of the discrepancy measures. In this paper, we show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the {\em PR-divergences }. Conversely, any -divergence can be written as a linear combination of PR-divergences and therefore correspond to minimising a weighted precision-recall trade-off. Further, we propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.
    Automatically Marginalized MCMC in Probabilistic Programming. (arXiv:2302.00564v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a powerful algorithm to sample latent variables from Bayesian models. The advent of probabilistic programming languages (PPLs) frees users from writing inference algorithms and lets users focus on modeling. However, many models are difficult for HMC to solve directly, which often require tricks like model reparameterization. We are motivated by the fact that many of those models could be simplified by marginalization. We propose to use automatic marginalization as part of the sampling process using HMC in a graphical model extracted from a PPL, which substantially improves sampling from real-world hierarchical models.
    QLAB: Quadratic Loss Approximation-Based Optimal Learning Rate for Deep Learning. (arXiv:2302.00252v1 [cs.LG])
    We propose a learning rate adaptation scheme, called QLAB, for descent optimizers. We derive QLAB by optimizing the quadratic approximation of the loss function and QLAB can be combined with any optimizer who can provide the descent update direction. The computation of an adaptive learning rate with QLAB requires only computing an extra loss function value. We theoretically prove the convergence of the descent optimizers with QLAB. We demonstrate the effectiveness of QLAB in a range of optimization problems by combining with conclusively stochastic gradient descent, stochastic gradient descent with momentum, and Adam. The performance is validated on multi-layer neural networks, CNN, VGG-Net, ResNet and ShuffleNet with two datasets, MNIST and CIFAR10.
    Simplicity Bias in 1-Hidden Layer Neural Networks. (arXiv:2302.00457v1 [cs.LG])
    Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.
    Electrode Selection for Noninvasive Fetal Electrocardiogram Extraction using Mutual Information Criteria. (arXiv:2302.00206v1 [eess.SP])
    Blind source separation (BSS) techniques have revealed to be promising approaches for, among other, biomedical signal processing applications. Specifically, for the noninvasive extraction of fetal cardiac signals from maternal abdominal recordings, where conventional filtering schemes have failed to extract the complete fetal ECG components. From previous studies, it is now believed that a carefully selected array of electrodes well-placed over the abdomen of a pregnant woman contains the required `information' for BSS, to extract the complete fetal components. Based on this idea, in previous works array recording systems and sensor selection strategies based on the Mutual Information (MI) criterion have been developed. In this paper the previous works have been extended, by considering the 3-dimensional aspects of the cardiac electrical activity. The proposed method has been tested on simulated and real maternal abdominal recordings. The results show that the new sensor selection strategy together with the MI criterion, can be effectively used to select the channels containing the most `information' concerning the fetal ECG components from an array of 72 recordings. The method is hence believed to be useful for the selection of the most informative channels in online applications, considering the different fetal positions and movements.
    Iterative Deepening Hyperband. (arXiv:2302.00511v1 [cs.LG])
    Hyperparameter optimization (HPO) is concerned with the automated search for the most appropriate hyperparameter configuration (HPC) of a parameterized machine learning algorithm. A state-of-the-art HPO method is Hyperband, which, however, has its own parameters that influence its performance. One of these parameters, the maximal budget, is especially problematic: If chosen too small, the budget needs to be increased in hindsight and, as Hyperband is not incremental by design, the entire algorithm must be re-run. This is not only costly but also comes with a loss of valuable knowledge already accumulated. In this paper, we propose incremental variants of Hyperband that eliminate these drawbacks, and show that these variants satisfy theoretical guarantees qualitatively similar to those for the original Hyperband with the "right" budget. Moreover, we demonstrate their practical utility in experiments with benchmark data sets.
    Decompositional Generation Process for Instance-Dependent Partial Label Learning. (arXiv:2204.03845v3 [cs.LG] UPDATED)
    Partial label learning (PLL) is a typical weakly supervised learning problem, where each training example is associated with a set of candidate labels among which only one is true. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels and model the generation process of the candidate labels in a simple way. However, these approaches usually do not perform as well as expected due to the fact that the generation process of the candidate labels is always instance-dependent. Therefore, it deserves to be modeled in a refined way. In this paper, we consider instance-dependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts, where the correct label emerges first in the mind of the annotator but then the incorrect labels related to the feature are also selected with the correct label as candidate labels due to uncertainty of labeling. Motivated by this consideration, we propose a novel PLL method that performs Maximum A Posterior (MAP) based on an explicitly modeled generation process of candidate labels via decomposed probability distribution models. Extensive experiments on manually corrupted benchmark datasets and real-world datasets validate the effectiveness of the proposed method. Source code is available at https://github.com/palm-ml/idgp.
    SPIDE: A Purely Spike-based Method for Training Feedback Spiking Neural Networks. (arXiv:2302.00232v1 [cs.NE])
    Spiking neural networks (SNNs) with event-based computation are promising brain-inspired models for energy-efficient applications on neuromorphic hardware. However, most supervised SNN training methods, such as conversion from artificial neural networks or direct training with surrogate gradients, require complex computation rather than spike-based operations of spiking neurons during training. In this paper, we study spike-based implicit differentiation on the equilibrium state (SPIDE) that extends the recently proposed training method, implicit differentiation on the equilibrium state (IDE), for supervised learning with purely spike-based computation, which demonstrates the potential for energy-efficient training of SNNs. Specifically, we introduce ternary spiking neuron couples and prove that implicit differentiation can be solved by spikes based on this design, so the whole training procedure, including both forward and backward passes, is made as event-driven spike computation, and weights are updated locally with two-stage average firing rates. Then we propose to modify the reset membrane potential to reduce the approximation error of spikes. With these key components, we can train SNNs with flexible structures in a small number of time steps and with firing sparsity during training, and the theoretical estimation of energy costs demonstrates the potential for high efficiency. Meanwhile, experiments show that even with these constraints, our trained models can still achieve competitive results on MNIST, CIFAR-10, CIFAR-100, and CIFAR10-DVS. Our code is available at https://github.com/pkuxmq/SPIDE-FSNN.
    Dictionary-based Manifold Learning. (arXiv:2302.00263v1 [cs.LG])
    We propose a paradigm for interpretable Manifold Learning for scientific data analysis, whereby we parametrize a manifold with $d$ smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse non-linear regression in the manifold tangent bundle, bypassing more standard manifold learning algorithms. We also discuss conditions for the existence of such parameterizations in function space and for successful recovery from finite samples. We demonstrate our method with experimental results from a real scientific domain.
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v2 [stat.ML] UPDATED)
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that time series data are non-exchangeable, and thus many existing conformal prediction algorithms are not applicable. The main idea is to exploit the temporal dependence of non-conformity scores (e.g., prediction residuals); thus, the past residuals contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a user-specified point prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.
    End-to-End Full-Atom Antibody Design. (arXiv:2302.00203v1 [q-bio.BM])
    Antibody design is an essential yet challenging task in various domains like therapeutics and biology. There are two major defects in current learning-based methods: 1) tackling only a certain subtask of the whole antibody design pipeline, making them suboptimal or resource-intensive. 2) omitting either the framework regions or side chains, thus incapable of capturing the full-atom geometry. To address these pitfalls, we propose dynamic Multi-channel Equivariant grAph Network (dyMEAN), an end-to-end full-atom model for E(3)-equivariant antibody design given the epitope and the incomplete sequence of the antibody. Specifically, we first explore structural initialization as a knowledgeable guess of the antibody structure and then propose shadow paratope to bridge the epitope-antibody connections. Both 1D sequences and 3D structures are updated via an adaptive multi-channel equivariant encoder that is able to process protein residues of variable sizes when considering full atoms. Finally, the updated antibody is docked to the epitope via the alignment of the shadow paratope. Experiments on epitope-binding CDR-H3 design, complex structure prediction, and affinity optimization demonstrate the superiority of our end-to-end framework and full-atom modeling.
    Whats Missing? Learning Hidden Markov Models When the Locations of Missing Observations are Unknown. (arXiv:2203.06527v2 [stat.ML] UPDATED)
    The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis, and it has been successfully applied in a large variety of domains. One of the key reasons for this versatility is the ability of HMMs to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In some situations where such assumptions are not feasible, a number of special algorithms have been developed. Currently, these algorithms rely strongly on specific structural assumptions of the underlying chain, such as acyclicity, and are not applicable in the general case. In particular, there are numerous domains within medicine and computational biology, where the missing observation locations are unknown and acyclicity assumptions do not hold, thus presenting a barrier for the application of HMMs in those fields. In this paper we consider a general problem of learning HMMs from data with unknown missing observation locations (i.e., only the order of the non-missing observations are known). We introduce a generative model of the location omissions and propose two learning methods for this model, a (semi) analytic approach, and a Gibbs sampler. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision and robustness under model misspecification.
    Width and Depth Limits Commute in Residual Networks. (arXiv:2302.00453v1 [stat.ML])
    We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We also demonstrate that the pre-activations, in this case, have Gaussian distributions which has direct applications in Bayesian deep learning. We conduct extensive simulations that show an excellent match with our theoretical findings.
    Exploring Semantic Perturbations on Grover. (arXiv:2302.00509v1 [cs.LG])
    With news and information being as easy to access as they currently are, it is more important than ever to ensure that people are not mislead by what they read. Recently, the rise of neural fake news (AI-generated fake news) and its demonstrated effectiveness at fooling humans has prompted the development of models to detect it. One such model is the Grover model, which can both detect neural fake news to prevent it, and generate it to demonstrate how a model could be misused to fool human readers. In this work we explore the Grover model's fake news detection capabilities by performing targeted attacks through perturbations on input news articles. Through this we test Grover's resilience to these adversarial attacks and expose some potential vulnerabilities which should be addressed in further iterations to ensure it can detect all types of fake news accurately.
    Gradient Descent in Neural Networks as Sequential Learning in RKBS. (arXiv:2302.00205v1 [stat.ML])
    The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training.
    Molecular Graph Generation by Decomposition and Reassembling. (arXiv:2302.00587v1 [q-bio.BM])
    Designing molecular structures with desired chemical properties is an essential task in drug discovery and material design. However, finding molecules with the optimized desired properties is still a challenging task due to combinatorial explosion of candidate space of molecules. Here we propose a novel \emph{decomposition-and-reassembling} based approach, which does not include any optimization in hidden space and our generation process is highly interpretable. Our method is a two-step procedure: In the first decomposition step, we apply frequent subgraph mining to a molecular database to collect smaller size of subgraphs as building blocks of molecules. In the second reassembling step, we search desirable building blocks guided via reinforcement learning and combine them to generate new molecules. Our experiments show that not only can our method find better molecules in terms of two standard criteria, the penalized $\log P$ and drug-likeness, but also generate drug molecules with showing the valid intermediate molecules.
    Physics-informed Reduced-Order Learning from the First Principles for Simulation of Quantum Nanostructures. (arXiv:2302.00100v1 [cs.CE])
    Multi-dimensional direct numerical simulation (DNS) of the Schr\"odinger equation is needed for design and analysis of quantum nanostructures that offer numerous applications in biology, medicine, materials, electronic/photonic devices, etc. In large-scale nanostructures, extensive computational effort needed in DNS may become prohibitive due to the high degrees of freedom (DoF). This study employs a reduced-order learning algorithm, enabled by the first principles, for simulation of the Schr\"odinger equation to achieve high accuracy and efficiency. The proposed simulation methodology is applied to investigate two quantum-dot structures; one operates under external electric field, and the other is influenced by internal potential variation with periodic boundary conditions. The former is similar to typical operations of nanoelectronic devices, and the latter is of interest to simulation and design of nanostructures and materials, such as applications of density functional theory. Using the proposed methodology, a very accurate prediction can be realized with a reduction in the DoF by more than 3 orders of magnitude and in the computational time by 2 orders, compared to DNS. The proposed physics-informed learning methodology is also able to offer an accurate prediction beyond the training conditions, including higher external field and larger internal potential in untrained quantum states.
    Learning Functional Transduction. (arXiv:2302.00328v1 [cs.LG])
    Research in Machine Learning has polarized into two general regression approaches: Transductive methods derive estimates directly from available data but are usually problem unspecific. Inductive methods can be much more particular, but generally require tuning and compute-intensive searches for solutions. In this work, we adopt a hybrid approach: We leverage the theory of Reproducing Kernel Banach Spaces (RKBS) and show that transductive principles can be induced through gradient descent to form efficient \textit{in-context} neural approximators. We apply this approach to RKBS of function-valued operators and show that once trained, our \textit{Transducer} model can capture on-the-fly relationships between infinite-dimensional input and output functions, given a few example pairs, and return new function estimates. We demonstrate the benefit of our transductive approach to model complex physical systems influenced by varying external factors with little data at a fraction of the usual deep learning training computation cost for partial differential equations and climate modeling applications.
    Diffusion-based Image Translation using Disentangled Style and Content Representation. (arXiv:2209.15264v2 [cs.CV] UPDATED)
    Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Specifically, inspired by the splicing Vision Transformer, we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. Then, an image guided style transfer is performed by matching the [CLS] classification token from the denoised samples and target image, whereas additional CLIP loss is used for the text-driven style transfer. To further accelerate the semantic change during the reverse diffusion, we also propose a novel semantic divergence loss and resampling strategy. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
    Learning noisy-OR Bayesian Networks with Max-Product Belief Propagation. (arXiv:2302.00099v1 [cs.LG])
    Noisy-OR Bayesian Networks (BNs) are a family of probabilistic graphical models which express rich statistical dependencies in binary data. Variational inference (VI) has been the main method proposed to learn noisy-OR BNs with complex latent structures (Jaakkola & Jordan, 1999; Ji et al., 2020; Buhai et al., 2020). However, the proposed VI approaches either (a) use a recognition network with standard amortized inference that cannot induce ``explaining-away''; or (b) assume a simple mean-field (MF) posterior which is vulnerable to bad local optima. Existing MF VI methods also update the MF parameters sequentially which makes them inherently slow. In this paper, we propose parallel max-product as an alternative algorithm for learning noisy-OR BNs with complex latent structures and we derive a fast stochastic training scheme that scales to large datasets. We evaluate both approaches on several benchmarks where VI is the state-of-the-art and show that our method (a) achieves better test performance than Ji et al. (2020) for learning noisy-OR BNs with hierarchical latent structures on large sparse real datasets; (b) recovers a higher number of ground truth parameters than Buhai et al. (2020) from cluttered synthetic scenes; and (c) solves the 2D blind deconvolution problem from Lazaro-Gredilla et al. (2021) and variant - including binary matrix factorization - while VI catastrophically fails and is up to two orders of magnitude slower.
    Filtering Context Mitigates Scarcity and Selection Bias in Political Ideology Prediction. (arXiv:2302.00239v1 [cs.LG])
    We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.
    How to select predictive models for causal inference?. (arXiv:2302.00370v1 [stat.ML])
    Predictive models -- as with machine learning -- can underpin causal inference, to estimate the effects of an intervention at the population or individual level. This opens the door to a plethora of models, useful to match the increasing complexity of health data, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Classic machine-learning cross-validation procedures are not directly applicable. Indeed, an appropriate selection procedure for causal inference should equally weight both outcome errors for each individual, treated or not treated, whereas one outcome may be seldom observed for a sub-population. We study how more elaborate risks benefit causal model selection. We show theoretically that simple risks are brittle to weak overlap between treated and non-treated individuals as well as to heterogeneous errors between populations. Rather a more elaborate metric, the R-risk appears as a proxy of the oracle error on causal estimates, observable at the cost of an overlap re-weighting. As the R-risk is defined not only from model predictions but also by using the conditional mean outcome and the treatment probability, using it for model selection requires adapting cross validation. Extensive experiments show that the resulting procedure gives the best causal model selection.
    Personalized Privacy Auditing and Optimization at Test Time. (arXiv:2302.00077v1 [cs.LG])
    A number of learning models used in consequential domains, such as to assist in legal, banking, hiring, and healthcare decisions, make use of potentially sensitive users' information to carry out inference. Further, the complete set of features is typically required to perform inference. This not only poses severe privacy risks for the individuals using the learning systems, but also requires companies and organizations massive human efforts to verify the correctness of the released information. This paper asks whether it is necessary to require \emph{all} input features for a model to return accurate predictions at test time and shows that, under a personalized setting, each individual may need to release only a small subset of these features without impacting the final decisions. The paper also provides an efficient sequential algorithm that chooses which attributes should be provided by each individual. Evaluation over several learning tasks shows that individuals may be able to report as little as 10\% of their information to ensure the same level of accuracy of a model that uses the complete users' information.
    Mind the (optimality) Gap: A Gap-Aware Learning Rate Scheduler for Adversarial Nets. (arXiv:2302.00089v1 [cs.LG])
    Adversarial nets have proved to be powerful in various domains including generative modeling (GANs), transfer learning, and fairness. However, successfully training adversarial nets using first-order methods remains a major challenge. Typically, careful choices of the learning rates are needed to maintain the delicate balance between the competing networks. In this paper, we design a novel learning rate scheduler that dynamically adapts the learning rate of the adversary to maintain the right balance. The scheduler is driven by the fact that the loss of an ideal adversarial net is a constant known a priori. The scheduler is thus designed to keep the loss of the optimized adversarial net close to that of an ideal network. We run large-scale experiments to study the effectiveness of the scheduler on two popular applications: GANs for image generation and adversarial nets for domain adaptation. Our experiments indicate that adversarial nets trained with the scheduler are less likely to diverge and require significantly less tuning. For example, on CelebA, a GAN with the scheduler requires only one-tenth of the tuning budget needed without a scheduler. Moreover, the scheduler leads to statistically significant improvements in model quality, reaching up to $27\%$ in Frechet Inception Distance for image generation and $3\%$ in test accuracy for domain adaptation.
    Neuromechanical Autoencoders: Learning to Couple Elastic and Neural Network Nonlinearity. (arXiv:2302.00032v1 [cs.LG])
    Intelligent biological systems are characterized by their embodiment in a complex environment and the intimate interplay between their nervous systems and the nonlinear mechanical properties of their bodies. This coordination, in which the dynamics of the motor system co-evolved to reduce the computational burden on the brain, is referred to as ``mechanical intelligence'' or ``morphological computation''. In this work, we seek to develop machine learning analogs of this process, in which we jointly learn the morphology of complex nonlinear elastic solids along with a deep neural network to control it. By using a specialized differentiable simulator of elastic mechanics coupled to conventional deep learning architectures -- which we refer to as neuromechanical autoencoders -- we are able to learn to perform morphological computation via gradient descent. Key to our approach is the use of mechanical metamaterials -- cellular solids, in particular -- as the morphological substrate. Just as deep neural networks provide flexible and massively-parametric function approximators for perceptual and control tasks, cellular solid metamaterials are promising as a rich and learnable space for approximating a variety of actuation tasks. In this work we take advantage of these complementary computational concepts to co-design materials and neural network controls to achieve nonintuitive mechanical behavior. We demonstrate in simulation how it is possible to achieve translation, rotation, and shape matching, as well as a ``digital MNIST'' task. We additionally manufacture and evaluate one of the designs to verify its real-world behavior.
    Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning. (arXiv:2302.00390v1 [cs.DL])
    The scholarly publication space is growing steadily not just in numbers but also in complexity due to collaboration between individuals from within and across fields of research. This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set of fields (discipline-field-subfield). This system enables a holistic view about the interdependence of research activities in the mentioned hierarchical tiers in terms of knowledge production through articles and impact through citations. The classification system (44 disciplines - 738 fields - 1,501 subfields) utilizes and is able to cope with 160 million abstract snippets in Microsoft Academic Graph (Version 2018-05-17) using batch training in a modularized and distributed fashion to address and assess interdisciplinarity and inter-field classifications. In addition, we have explored multi-class classifications in both the single-label and multi-label settings. In total, we have conducted 3,140 experiments, in all models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers), the classification accuracy is > 90% in 77.84% and 78.83% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, as well as to capture the degree of interdisciplinarity in a publication which enables downstream analytics such as field interdisciplinarity. This system (a set of pretrained models) can serve as a backbone to an interactive system of indexing scientific publications.
    Student-centric Model of Learning Management System Activity and Academic Performance: from Correlation to Causation. (arXiv:2210.15430v2 [cs.CY] UPDATED)
    In recent years, there is a lot of interest in modeling students' digital traces in Learning Management System (LMS) to understand students' learning behavior patterns including aspects of meta-cognition and self-regulation, with the ultimate goal to turn those insights into actionable information to support students to improve their learning outcomes. In achieving this goal, however, there are two main issues that need to be addressed given the existing literature. Firstly, most of the current work is course-centered (i.e. models are built from data for a specific course) rather than student-centered; secondly, a vast majority of the models are correlational rather than causal. Those issues make it challenging to identify the most promising actionable factors for intervention at the student level where most of the campus-wide academic support is designed for. In this paper, we explored a student-centric analytical framework for LMS activity data that can provide not only correlational but causal insights mined from observational data. We demonstrated this approach using a dataset of 1651 computing major students at a public university in the US during one semester in the Fall of 2019. This dataset includes students' fine-grained LMS interaction logs and administrative data, e.g. demographics and academic performance. In addition, we expand the repository of LMS behavior indicators to include those that can characterize the time-of-the-day of login (e.g. chronotype). Our analysis showed that student login volume, compared with other login behavior indicators, is both strongly correlated and causally linked to student academic performance, especially among students with low academic performance. We envision that those insights will provide convincing evidence for college student support groups to launch student-centered and targeted interventions that are effective and scalable.
    Effectiveness of Moving Target Defenses for Adversarial Attacks in ML-based Malware Detection. (arXiv:2302.00537v1 [cs.LG])
    Several moving target defenses (MTDs) to counter adversarial ML attacks have been proposed in recent years. MTDs claim to increase the difficulty for the attacker in conducting attacks by regularly changing certain elements of the defense, such as cycling through configurations. To examine these claims, we study for the first time the effectiveness of several recent MTDs for adversarial ML attacks applied to the malware detection domain. Under different threat models, we show that transferability and query attack strategies can achieve high levels of evasion against these defenses through existing and novel attack strategies across Android and Windows. We also show that fingerprinting and reconnaissance are possible and demonstrate how attackers may obtain critical defense hyperparameters as well as information about how predictions are produced. Based on our findings, we present key recommendations for future work on the development of effective MTDs for adversarial attacks in ML-based malware detection.
    Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization. (arXiv:2302.00275v1 [cs.CV])
    Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
    Fast Sampling of Diffusion Models via Operator Learning. (arXiv:2211.13449v2 [cs.LG] UPDATED)
    Diffusion models have found widespread adoption in various areas. However, their sampling process is slow because it requires hundreds to thousands of network evaluations to emulate a continuous process defined by differential equations. In this work, we use neural operators, an efficient method to solve the probability flow differential equations, to accelerate the sampling process of diffusion models. Compared to other fast sampling methods that have a sequential nature, we are the first to propose parallel decoding method that generates images with only one model forward pass. We propose \textit{diffusion model sampling with neural operator} (DSNO) that maps the initial condition, i.e., Gaussian distribution, to the continuous-time solution trajectory of the reverse diffusion process. To model the temporal correlations along the trajectory, we introduce temporal convolution layers that are parameterized in the Fourier space into the given diffusion model backbone. We show our method achieves state-of-the-art FID of 4.12 for CIFAR-10 and 8.35 for ImageNet-64 in the one-model-evaluation setting.
    Predicting CSI Sequences With Attention-Based Neural Networks. (arXiv:2302.00341v1 [stat.ML])
    In this work, we consider the problem of multi-step channel prediction in wireless communication systems. In existing works, autoregressive (AR) models are either replaced or combined with feed-forward neural networks(NNs) or, alternatively, with recurrent neural networks (RNNs). This paper explores the possibility of using sequence-to-sequence (Seq2Seq) and transformer neural network (TNN) models for channel state information (CSI) prediction. Simulation results show that both, Seq2Seq and TNNs, represent an appealing alternative to RNNs and feed-forward NNs in the context of CSI prediction. Additionally, the TNN with a few adaptations can extrapolate better than other models to CSI sequences that are either shorter or longer than the ones the model saw during training.
    Distributed Traffic Synthesis and Classification in Edge Networks: A Federated Self-supervised Learning Approach. (arXiv:2302.00207v1 [cs.LG])
    With the rising demand for wireless services and increased awareness of the need for data protection, existing network traffic analysis and management architectures are facing unprecedented challenges in classifying and synthesizing the increasingly diverse services and applications. This paper proposes FS-GAN, a federated self-supervised learning framework to support automatic traffic analysis and synthesis over a large number of heterogeneous datasets. FS-GAN is composed of multiple distributed Generative Adversarial Networks (GANs), with a set of generators, each being designed to generate synthesized data samples following the distribution of an individual service traffic, and each discriminator being trained to differentiate the synthesized data samples and the real data samples of a local dataset. A federated learning-based framework is adopted to coordinate local model training processes of different GANs across different datasets. FS-GAN can classify data of unknown types of service and create synthetic samples that capture the traffic distribution of the unknown types. We prove that FS-GAN can minimize the Jensen-Shannon Divergence (JSD) between the distribution of real data across all the datasets and that of the synthesized data samples. FS-GAN also maximizes the JSD among the distributions of data samples created by different generators, resulting in each generator producing synthetic data samples that follow the same distribution as one particular service type. Extensive simulation results show that the classification accuracy of FS-GAN achieves over 20% improvement in average compared to the state-of-the-art clustering-based traffic analysis algorithms. FS-GAN also has the capability to synthesize highly complex mixtures of traffic types without requiring any human-labeled data samples.
    Learning Topology-Preserving Data Representations. (arXiv:2302.00136v1 [cs.LG])
    We propose a method for learning topology-preserving data representations (dimensionality reduction). The method aims to provide topological similarity between the data manifold and its latent representation via enforcing the similarity in topological features (clusters, loops, 2D voids, etc.) and their localization. The core of the method is the minimization of the Representation Topology Divergence (RTD) between original high-dimensional data and low-dimensional representation in latent space. RTD minimization provides closeness in topological features with strong theoretical guarantees. We develop a scheme for RTD differentiation and apply it as a loss term for the autoencoder. The proposed method ``RTD-AE'' better preserves the global structure and topology of the data manifold than state-of-the-art competitors as measured by linear correlation, triplet distance ranking accuracy, and Wasserstein distance between persistence barcodes.
    Learning to be Fair: A Consequentialist Approach to Equitable Decision-Making. (arXiv:2109.08792v3 [cs.LG] UPDATED)
    In the dominant paradigm for designing equitable machine learning systems, one works to ensure that model predictions satisfy various fairness criteria, such as parity in error rates across race, gender, and other legally protected traits. That approach, however, typically ignores the downstream decisions and outcomes that predictions affect, and, as a result, can induce unexpected harms. Here we present an alternative framework for fairness that directly anticipates the consequences of decisions. Stakeholders first specify preferences over the possible outcomes of an algorithmically informed decision-making process. For example, lenders may prefer extending credit to those most likely to repay a loan, while also preferring similar lending rates across neighborhoods. One then searches the space of decision policies to maximize the specified utility. We develop and describe a method for efficiently learning these optimal policies from data for a large family of expressive utility functions, facilitating a more holistic approach to equitable decision-making.
    A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee. (arXiv:2302.00248v1 [cs.DS])
    Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.
    Homotopy-based training of NeuralODEs for accurate dynamics discovery. (arXiv:2210.01407v3 [cs.LG] UPDATED)
    Conceptually, Neural Ordinary Differential Equations (NeuralODEs) pose an attractive way to extract dynamical laws from time series data, as they are natural extensions of the traditional differential equation-based modeling paradigm of the physical sciences. In practice, NeuralODEs display long training times and suboptimal results, especially for longer duration data where they may fail to fit the data altogether. While methods have been proposed to stabilize NeuralODE training, many of these involve placing a strong constraint on the functional form the trained NeuralODE can take that the actual underlying governing equation does not guarantee satisfaction. In this work, we present a novel NeuralODE training algorithm that leverages tools from the chaos and mathematical optimization communities - synchronization and homotopy optimization - for a breakthrough in tackling the NeuralODE training obstacle. We demonstrate architectural changes are unnecessary for effective NeuralODE training. Compared to the conventional training methods, our algorithm achieves drastically lower loss values without any changes to the model architectures. Experiments on both simulated and real systems with complex temporal behaviors demonstrate NeuralODEs trained with our algorithm are able to accurately capture true long term behaviors and correctly extrapolate into the future.
    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets. (arXiv:2010.12909v3 [cs.LG] UPDATED)
    We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
    Delayed Feedback in Kernel Bandits. (arXiv:2302.00392v1 [stat.ML])
    Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with $\tilde{\mathcal{O}}(\sqrt{\Gamma_k(T)T}+\mathbb{E}[\tau])$ regret, where $T$ is the number of time steps, $\Gamma_k(T)$ is the maximum information gain of the kernel with $T$ observations, and $\tau$ is the delay random variable. This represents a significant improvement over the state of the art regret bound of $\tilde{\mathcal{O}}(\Gamma_k(T)\sqrt{T}+\mathbb{E}[\tau]\Gamma_k(T))$ reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.
    $\rm A^2Q$: Aggregation-Aware Quantization for Graph Neural Networks. (arXiv:2302.00193v1 [cs.LG])
    As graph data size increases, the vast latency and memory consumption during inference pose a significant challenge to the real-world deployment of Graph Neural Networks (GNNs). While quantization is a powerful approach to reducing GNNs complexity, most previous works on GNNs quantization fail to exploit the unique characteristics of GNNs, suffering from severe accuracy degradation. Through an in-depth analysis of the topology of GNNs, we observe that the topology of the graph leads to significant differences between nodes, and most of the nodes in a graph appear to have a small aggregation value. Motivated by this, in this paper, we propose the Aggregation-Aware mixed-precision Quantization ($\rm A^2Q$) for GNNs, where an appropriate bitwidth is automatically learned and assigned to each node in the graph. To mitigate the vanishing gradient problem caused by sparse connections between nodes, we propose a Local Gradient method to serve the quantization error of the node features as the supervision during training. We also develop a Nearest Neighbor Strategy to deal with the generalization on unseen graphs. Extensive experiments on eight public node-level and graph-level datasets demonstrate the generality and robustness of our proposed method. Compared to the FP32 models, our method can achieve up to a 18.6x (i.e., 1.70bit) compression ratio with negligible accuracy degradation. Morever, compared to the state-of-the-art quantization method, our method can achieve up to 11.4\% and 9.5\% accuracy improvements on the node-level and graph-level tasks, respectively, and up to 2x speedup on a dedicated hardware accelerator.
    Weight Prediction Boosts the Convergence of AdamW. (arXiv:2302.00195v1 [cs.LG])
    In this paper, we introduce weight prediction into the AdamW optimizer to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights to do both forward pass and backward propagation. In this way, the AdamW optimizer always utilizes the gradients w.r.t. the future weights instead of current weights to update the DNN parameters, making the AdamW optimizer achieve better convergence. Our proposal is simple and straightforward to implement but effective in boosting the convergence of DNN training. We performed extensive experimental evaluations on image classification and language modeling tasks to verify the effectiveness of our proposal. The experimental results validate that our proposal can boost the convergence of AdamW and achieve better accuracy than AdamW when training the DNN models.
    TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation. (arXiv:2302.00247v1 [cs.LG])
    Model parallelism has become necessary to train large neural networks. However, finding a suitable model parallel schedule for an arbitrary neural network is a non-trivial task due to the exploding search space. In this work, we present a model parallelism framework TAP that automatically searches for the best data and tensor parallel schedules. Leveraging the key insight that a neural network can be represented as a directed acyclic graph, within which may only exist a limited set of frequent subgraphs, we design a graph pruning algorithm to fold the search space efficiently. TAP runs at sub-linear complexity concerning the neural network size. Experiments show that TAP is $20\times- 160\times$ faster than the state-of-the-art automatic parallelism framework, and the performance of its discovered schedules is competitive with the expert-engineered ones.
    A Transaction Represented with Weighted Finite-State Transducers. (arXiv:2302.00200v1 [cs.FL])
    Not all contracts are good, but all good contracts can be expressed as a finite-state transition system ("State-Transition Contracts"). Contracts that can be represented as State-Transition Contracts discretize fat-tailed risk to foreseeable, managed risk, define the boundary of relevant events governed by the relationship, and eliminate the potential of inconsistent contractual provisions. Additionally, State-Transition Contracts reap the substantial benefit of being able to be analyzed under the rules governing the science of the theory of computation. Simple State-Transition Contracts can be represented as discrete finite automata; more complicated State-Transition Contracts, such as those that have downstream effects on other agreements or complicated pathways of performance, benefit from representation as weighted finite-state transducers, with weights assigned as costs, penalties, or probabilities of transitions. This research paper (the "Research" or "Paper") presents a complex legal transaction represented as weighted finite-state transducers. Furthermore, we show that the mathematics/algorithms permitted by the algebraic structure of weighted finite-state transducers provides actionable, legal insight into the transaction.
    Adaptive sparseness for correntropy-based robust regression via automatic relevance determination. (arXiv:2302.00082v1 [cs.LG])
    Sparseness and robustness are two important properties for many machine learning scenarios. In the present study, regarding the maximum correntropy criterion (MCC) based robust regression algorithm, we investigate to integrate the MCC method with the automatic relevance determination (ARD) technique in a Bayesian framework, so that MCC-based robust regression could be implemented with adaptive sparseness. To be specific, we use an inherent noise assumption from the MCC to derive an explicit likelihood function, and realize the maximum a posteriori (MAP) estimation with the ARD prior by variational Bayesian inference. Compared to the existing robust and sparse L1-regularized MCC regression, the proposed MCC-ARD regression can eradicate the troublesome tuning for the regularization hyper-parameter which controls the regularization strength. Further, MCC-ARD achieves superior prediction performance and feature selection capability than L1-regularized MCC, as demonstrated by a noisy and high-dimensional simulation study.
    A Prescriptive Learning Analytics Framework: Beyond Predictive Modelling and onto Explainable AI with Prescriptive Analytics and ChatGPT. (arXiv:2208.14582v2 [cs.LG] UPDATED)
    A significant body of recent research in the field of Learning Analytics has focused on leveraging machine learning approaches for predicting at-risk students in order to initiate timely interventions and thereby elevate retention and completion rates. The overarching feature of the majority of these research studies has been on the science of prediction only. The component of predictive analytics concerned with interpreting the internals of the models and explaining their predictions for individual cases to stakeholders has largely been neglected. Additionally, works that attempt to employ data-driven prescriptive analytics to automatically generate evidence-based remedial advice for at-risk learners are in their infancy. eXplainable AI is a field that has recently emerged providing cutting-edge tools which support transparent predictive analytics and techniques for generating tailored advice for at-risk students. This study proposes a novel framework that unifies both transparent machine learning as well as techniques for enabling prescriptive analytics, while integrating the latest advances in large language models. This work practically demonstrates the proposed framework using predictive models for identifying at-risk learners of programme non-completion. The study then further demonstrates how predictive modelling can be augmented with prescriptive analytics on two case studies in order to generate human-readable prescriptive feedback for those who are at risk using ChatGPT.
    Selective Uncertainty Propagation in Offline RL. (arXiv:2302.00284v1 [cs.LG])
    We study the finite-horizon offline reinforcement learning (RL) problem. Since actions at any state can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically complex than offline policy learning for a finite sequence of stochastic contextual bandit environments. We formalize this insight by showing that the statistical hardness of offline RL instances can be measured by estimating the size of actions' impact on next-state distributions. Furthermore, this estimated impact allows us to propagate just enough value function uncertainty from future steps to avoid model exploitation, enabling us to develop algorithms that improve upon traditional pessimistic approaches for offline RL on statistically simple instances. Our approach is supported by theory and simulations.
    Revisiting Bellman Errors for Offline Model Selection. (arXiv:2302.00141v1 [cs.LG])
    Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.
    Program Generation from Diverse Video Demonstrations. (arXiv:2302.00178v1 [cs.CV])
    The ability to use inductive reasoning to extract general rules from multiple observations is a vital indicator of intelligence. As humans, we use this ability to not only interpret the world around us, but also to predict the outcomes of the various interactions we experience. Generalising over multiple observations is a task that has historically presented difficulties for machines to grasp, especially when requiring computer vision. In this paper, we propose a model that can extract general rules from video demonstrations by simultaneously performing summarisation and translation. Our approach differs from prior works by framing the problem as a multi-sequence-to-sequence task, wherein summarisation is learnt by the model. This allows our model to utilise edge cases that would otherwise be suppressed or discarded by traditional summarisation techniques. Additionally, we show that our approach can handle noisy specifications without the need for additional filtering methods. We evaluate our model by synthesising programs from video demonstrations in the Vizdoom environment achieving state-of-the-art results with a relative increase of 11.75% program accuracy on prior works
    Distillation Policy Optimization. (arXiv:2302.00533v1 [cs.LG])
    On-policy algorithms are supposed to be stable, however, sample-intensive yet. Off-policy algorithms utilizing past experiences are deemed to be sample-efficient, nevertheless, unstable in general. Can we design an algorithm that can employ the off-policy data, while exploit the stable learning by sailing along the course of the on-policy walkway? In this paper, we present an actor-critic learning framework that borrows the distributional perspective of interest to evaluate, and cross-breeds two sources of the data for policy improvement, which enables fast learning and can be applied to a wide class of algorithms. In its backbone, the variance reduction mechanisms, such as unified advantage estimator (UAE), that extends generalized advantage estimator (GAE) to be applicable on any state-dependent baseline, and a learned baseline, that is competent to stabilize the policy gradient, are firstly put forward to not merely be a bridge to the action-value function but also distill the advantageous learning signal. Lastly, it is empirically shown that our method improves sample efficiency and interpolates different levels well. Being of an organic whole, its mixture places more inspiration to the algorithm design.
    GANravel: User-Driven Direction Disentanglement in Generative Adversarial Networks. (arXiv:2302.00079v1 [cs.HC])
    Generative adversarial networks (GANs) have many application areas including image editing, domain translation, missing data imputation, and support for creative work. However, GANs are considered 'black boxes'. Specifically, the end-users have little control over how to improve editing directions through disentanglement. Prior work focused on new GAN architectures to disentangle editing directions. Alternatively, we propose GANravel a user-driven direction disentanglement tool that complements the existing GAN architectures and allows users to improve editing directions iteratively. In two user studies with 16 participants each, GANravel users were able to disentangle directions and outperformed the state-of-the-art direction discovery baselines in disentanglement performance. In the second user study, GANravel was used in a creative task of creating dog memes and was able to create high-quality edited images and GIFs.  ( 2 min )
    Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor. (arXiv:2302.00036v1 [cs.LG])
    We introduce the Blackwell discount factor for Markov Decision Processes (MDPs). Classical objectives for MDPs include discounted, average, and Blackwell optimality. Many existing approaches to computing average-optimal policies solve for discounted optimal policies with a discount factor close to $1$, but they only work under strong or hard-to-verify assumptions such as ergodicity or weakly communicating MDPs. In this paper, we show that when the discount factor is larger than the Blackwell discount factor $\gamma_{\mathrm{bw}}$, all discounted optimal policies become Blackwell- and average-optimal, and we derive a general upper bound on $\gamma_{\mathrm{bw}}$. The upper bound on $\gamma_{\mathrm{bw}}$ provides the first reduction from average and Blackwell optimality to discounted optimality, without any assumptions, and new polynomial-time algorithms for average- and Blackwell-optimal policies. Our work brings new ideas from the study of polynomials and algebraic numbers to the analysis of MDPs. Our results also apply to robust MDPs, enabling the first algorithms to compute robust Blackwell-optimal policies.  ( 2 min )
  • Open

    Data fission: splitting a single data point. (arXiv:2112.11079v7 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.  ( 2 min )
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v3 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v2 [stat.ML] UPDATED)
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that time series data are non-exchangeable, and thus many existing conformal prediction algorithms are not applicable. The main idea is to exploit the temporal dependence of non-conformity scores (e.g., prediction residuals); thus, the past residuals contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a user-specified point prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.
    Simplicity Bias in 1-Hidden Layer Neural Networks. (arXiv:2302.00457v1 [cs.LG])
    Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.
    Optimal Learning of Deep Random Networks of Extensive-width. (arXiv:2302.00375v1 [stat.ML])
    We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We derive a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We contrast these Bayes-optimal errors with the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
    Probabilistic Neural Data Fusion for Learning from an Arbitrary Number of Multi-fidelity Data Sets. (arXiv:2301.13271v1 [cs.LG] CROSS LISTED)
    In many applications in engineering and sciences analysts have simultaneous access to multiple data sources. In such cases, the overall cost of acquiring information can be reduced via data fusion or multi-fidelity (MF) modeling where one leverages inexpensive low-fidelity (LF) sources to reduce the reliance on expensive high-fidelity (HF) data. In this paper, we employ neural networks (NNs) for data fusion in scenarios where data is very scarce and obtained from an arbitrary number of sources with varying levels of fidelity and cost. We introduce a unique NN architecture that converts MF modeling into a nonlinear manifold learning problem. Our NN architecture inversely learns non-trivial (e.g., non-additive and non-hierarchical) biases of the LF sources in an interpretable and visualizable manifold where each data source is encoded via a low-dimensional distribution. This probabilistic manifold quantifies model form uncertainties such that LF sources with small bias are encoded close to the HF source. Additionally, we endow the output of our NN with a parametric distribution not only to quantify aleatoric uncertainties, but also to reformulate the network's loss function based on strictly proper scoring rules which improve robustness and accuracy on unseen HF data. Through a set of analytic and engineering examples, we demonstrate that our approach provides a high predictive power while quantifying various sources uncertainties.
    Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders. (arXiv:2302.00662v1 [stat.ML])
    Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets. (arXiv:2010.12909v3 [cs.LG] UPDATED)
    We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v2 [stat.ML] UPDATED)
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.
    Stream-based active learning with linear models. (arXiv:2207.09874v3 [stat.ML] UPDATED)
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
    Additive Higher-Order Factorization Machines. (arXiv:2205.14515v2 [stat.CO] UPDATED)
    In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v2 [cs.LG] UPDATED)
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
    Sliced Optimal Partial Transport. (arXiv:2212.08049v4 [cs.LG] UPDATED)
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.
    Predicting CSI Sequences With Attention-Based Neural Networks. (arXiv:2302.00341v1 [stat.ML])
    In this work, we consider the problem of multi-step channel prediction in wireless communication systems. In existing works, autoregressive (AR) models are either replaced or combined with feed-forward neural networks(NNs) or, alternatively, with recurrent neural networks (RNNs). This paper explores the possibility of using sequence-to-sequence (Seq2Seq) and transformer neural network (TNN) models for channel state information (CSI) prediction. Simulation results show that both, Seq2Seq and TNNs, represent an appealing alternative to RNNs and feed-forward NNs in the context of CSI prediction. Additionally, the TNN with a few adaptations can extrapolate better than other models to CSI sequences that are either shorter or longer than the ones the model saw during training.
    Deterministic equivalent and error universality of deep random features learning. (arXiv:2302.00401v1 [stat.ML])
    This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
    Variational Causal Inference. (arXiv:2209.05935v2 [stat.ML] UPDATED)
    Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest.
    Distribution free optimality intervals for clustering. (arXiv:2107.14442v2 [stat.ML] UPDATED)
    We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.
    Approximate Bayesian Computation with Path Signatures. (arXiv:2106.12555v2 [stat.ME] UPDATED)
    Simulation models often lack tractable likelihood functions, making likelihood-free inference methods indispensable. Approximate Bayesian computation generates likelihood-free posterior samples by comparing simulated and observed data through some distance measure, but existing approaches are often poorly suited to time series simulators, for example due to an independent and identically distributed data assumption. In this paper, we propose to use path signatures in approximate Bayesian computation to handle the sequential nature of time series. We provide theoretical guarantees on the resultant posteriors and demonstrate competitive Bayesian parameter inference for simulators generating univariate, multivariate, irregularly spaced, and even non-Euclidean sequences.
    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v2 [cs.LG] UPDATED)
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.
    How to select predictive models for causal inference?. (arXiv:2302.00370v1 [stat.ML])
    Predictive models -- as with machine learning -- can underpin causal inference, to estimate the effects of an intervention at the population or individual level. This opens the door to a plethora of models, useful to match the increasing complexity of health data, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Classic machine-learning cross-validation procedures are not directly applicable. Indeed, an appropriate selection procedure for causal inference should equally weight both outcome errors for each individual, treated or not treated, whereas one outcome may be seldom observed for a sub-population. We study how more elaborate risks benefit causal model selection. We show theoretically that simple risks are brittle to weak overlap between treated and non-treated individuals as well as to heterogeneous errors between populations. Rather a more elaborate metric, the R-risk appears as a proxy of the oracle error on causal estimates, observable at the cost of an overlap re-weighting. As the R-risk is defined not only from model predictions but also by using the conditional mean outcome and the treatment probability, using it for model selection requires adapting cross validation. Extensive experiments show that the resulting procedure gives the best causal model selection.
    Learning Equilibria in Matching Markets from Bandit Feedback. (arXiv:2108.08843v2 [cs.LG] UPDATED)
    Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. Classical notions of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) are unfortunately of limited value in the learning setting, given that preferences are inherently uncertain and destabilizing while they are being learned. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty," the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v3 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    Automatically Marginalized MCMC in Probabilistic Programming. (arXiv:2302.00564v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a powerful algorithm to sample latent variables from Bayesian models. The advent of probabilistic programming languages (PPLs) frees users from writing inference algorithms and lets users focus on modeling. However, many models are difficult for HMC to solve directly, which often require tricks like model reparameterization. We are motivated by the fact that many of those models could be simplified by marginalization. We propose to use automatic marginalization as part of the sampling process using HMC in a graphical model extracted from a PPL, which substantially improves sampling from real-world hierarchical models.
    Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity. (arXiv:2211.07092v3 [stat.ML] UPDATED)
    In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
    A Group-Equivariant Autoencoder for Identifying Spontaneously Broken Symmetries. (arXiv:2202.06319v2 [cond-mat.stat-mech] UPDATED)
    We introduce the group-equivariant autoencoder (GE-autoencoder) -- a deep neural network (DNN) method that locates phase boundaries by determining which symmetries of the Hamiltonian have spontaneously broken at each temperature. We use group theory to deduce which symmetries of the system remain intact in all phases, and then use this information to constrain the parameters of the GE-autoencoder such that the encoder learns an order parameter invariant to these ``never-broken'' symmetries. This procedure produces a dramatic reduction in the number of free parameters such that the GE-autoencoder size is independent of the system size. We include symmetry regularization terms in the loss function of the GE-autoencoder so that the learned order parameter is also equivariant to the remaining symmetries of the system. By examining the group representation by which the learned order parameter transforms, we are then able to extract information about the associated spontaneous symmetry breaking. We test the GE-autoencoder on the 2D classical ferromagnetic and antiferromagnetic Ising models, finding that the GE-autoencoder (1) accurately determines which symmetries have spontaneously broken at each temperature; (2) estimates the critical temperature in the thermodynamic limit with greater accuracy, robustness, and time-efficiency than a symmetry-agnostic baseline autoencoder; and (3) detects the presence of an external symmetry-breaking magnetic field with greater sensitivity than the baseline method. Finally, we describe various key implementation details, including a new method for extracting the critical temperature estimate from trained autoencoders and calculations of the DNN initialization and learning rate settings required for fair model comparisons.
    On the Within-Group Discrimination of Screening Classifiers. (arXiv:2302.00025v1 [cs.LG])
    Screening classifiers are increasingly used to identify qualified candidates in a variety of selection processes. In this context, it has been recently shown that, if a classifier is calibrated, one can identify the smallest set of candidates which contains, in expectation, a desired number of qualified candidates using a threshold decision rule. This lends support to focusing on calibration as the only requirement for screening classifiers. In this paper, we argue that screening policies that use calibrated classifiers may suffer from an understudied type of within-group discrimination -- they may discriminate against qualified members within demographic groups of interest. Further, we argue that this type of discrimination can be avoided if classifiers satisfy within-group monotonicity, a natural monotonicity property within each of the groups. Then, we introduce an efficient post-processing algorithm based on dynamic programming to minimally modify a given calibrated classifier so that its probability estimates satisfy within-group monotonicity. We validate our algorithm using US Census survey data and show that within-group monotonicity can be often achieved at a small cost in terms of prediction granularity and shortlist size.
    A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel. (arXiv:2206.12543v2 [stat.ML] UPDATED)
    Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite width NTKs. For networks with O output units (e.g. an O-class classifier), however, the eNTK on N inputs is of size $NO \times NO$, taking $O((NO)^2)$ memory and up to $O((NO)^3)$ computation. Most existing applications have therefore used one of a handful of approximations yielding $N \times N$ kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call "sum of logits", converges to the true eNTK at initialization for any network with a wide final "readout" layer. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.
    $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning. (arXiv:2302.00543v1 [cs.LG])
    Many compression techniques have been proposed to reduce the communication overhead of Federated Learning training procedures. However, these are typically designed for compressing model updates, which are expected to decay throughout training. As a result, such methods are inapplicable to downlink (i.e., from the parameter server to clients) compression in the cross-device setting, where heterogeneous clients $\textit{may appear only once}$ during training and thus must download the model parameters. In this paper, we propose a new framework ($\texttt{DoCoFL}$) for downlink compression in the cross-device federated learning setting. Importantly, $\texttt{DoCoFL}$ can be seamlessly combined with many uplink compression schemes, rendering it suitable for bi-directional compression. Through extensive evaluation, we demonstrate that $\texttt{DoCoFL}$ offers significant bi-directional bandwidth reduction while achieving competitive accuracy to that of $\texttt{FedAvg}$ without compression.
    Width and Depth Limits Commute in Residual Networks. (arXiv:2302.00453v1 [stat.ML])
    We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We also demonstrate that the pre-activations, in this case, have Gaussian distributions which has direct applications in Bayesian deep learning. We conduct extensive simulations that show an excellent match with our theoretical findings.
    Delayed Feedback in Kernel Bandits. (arXiv:2302.00392v1 [stat.ML])
    Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with $\tilde{\mathcal{O}}(\sqrt{\Gamma_k(T)T}+\mathbb{E}[\tau])$ regret, where $T$ is the number of time steps, $\Gamma_k(T)$ is the maximum information gain of the kernel with $T$ observations, and $\tau$ is the delay random variable. This represents a significant improvement over the state of the art regret bound of $\tilde{\mathcal{O}}(\Gamma_k(T)\sqrt{T}+\mathbb{E}[\tau]\Gamma_k(T))$ reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.
    Distributed sequential federated learning. (arXiv:2302.00107v1 [stat.ML])
    The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.  ( 2 min )
    Whats Missing? Learning Hidden Markov Models When the Locations of Missing Observations are Unknown. (arXiv:2203.06527v2 [stat.ML] UPDATED)
    The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis, and it has been successfully applied in a large variety of domains. One of the key reasons for this versatility is the ability of HMMs to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In some situations where such assumptions are not feasible, a number of special algorithms have been developed. Currently, these algorithms rely strongly on specific structural assumptions of the underlying chain, such as acyclicity, and are not applicable in the general case. In particular, there are numerous domains within medicine and computational biology, where the missing observation locations are unknown and acyclicity assumptions do not hold, thus presenting a barrier for the application of HMMs in those fields. In this paper we consider a general problem of learning HMMs from data with unknown missing observation locations (i.e., only the order of the non-missing observations are known). We introduce a generative model of the location omissions and propose two learning methods for this model, a (semi) analytic approach, and a Gibbs sampler. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision and robustness under model misspecification.  ( 2 min )
    Diffusion-based Image Translation using Disentangled Style and Content Representation. (arXiv:2209.15264v2 [cs.CV] UPDATED)
    Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Specifically, inspired by the splicing Vision Transformer, we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. Then, an image guided style transfer is performed by matching the [CLS] classification token from the denoised samples and target image, whereas additional CLIP loss is used for the text-driven style transfer. To further accelerate the semantic change during the reverse diffusion, we also propose a novel semantic divergence loss and resampling strategy. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.  ( 2 min )
    Filtering Context Mitigates Scarcity and Selection Bias in Political Ideology Prediction. (arXiv:2302.00239v1 [cs.LG])
    We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.  ( 2 min )
    Adaptive sparseness for correntropy-based robust regression via automatic relevance determination. (arXiv:2302.00082v1 [cs.LG])
    Sparseness and robustness are two important properties for many machine learning scenarios. In the present study, regarding the maximum correntropy criterion (MCC) based robust regression algorithm, we investigate to integrate the MCC method with the automatic relevance determination (ARD) technique in a Bayesian framework, so that MCC-based robust regression could be implemented with adaptive sparseness. To be specific, we use an inherent noise assumption from the MCC to derive an explicit likelihood function, and realize the maximum a posteriori (MAP) estimation with the ARD prior by variational Bayesian inference. Compared to the existing robust and sparse L1-regularized MCC regression, the proposed MCC-ARD regression can eradicate the troublesome tuning for the regularization hyper-parameter which controls the regularization strength. Further, MCC-ARD achieves superior prediction performance and feature selection capability than L1-regularized MCC, as demonstrated by a noisy and high-dimensional simulation study.  ( 2 min )
    The Parametric Stability of Well-separated Spherical Gaussian Mixtures. (arXiv:2302.00242v1 [stat.ML])
    We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bound only depends on $P$. The motivation for this work lies in providing guarantees for fitting Gaussian mixtures; with this aim in mind, all the constants involved are well defined and distribution free conditions for fitting mixtures of spherical Gaussians. Our results tighten considerably the existing computable bounds, and asymptotically match the known sharp thresholds for this problem.  ( 2 min )
    Tensor networks for unsupervised machine learning. (arXiv:2106.12974v2 [cond-mat.stat-mech] UPDATED)
    Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have the advantages of a principle understanding of the expressive power using entanglement properties, and as a bridge connecting classical computation and quantum computation. Despite the great potential, however, existing tensor network models for unsupervised machine learning only work as a proof of principle, as their performance is much worse than the standard models such as restricted Boltzmann machines and neural networks. In this Letter, we present autoregressive matrix product states (AMPS), a tensor network model combining matrix product states from quantum many-body physics and autoregressive modeling from machine learning. Our model enjoys the exact calculation of normalized probability and unbiased sampling. We demonstrate the performance of our model using two applications, generative modeling on synthetic and real-world data, and reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor network models and the restricted Boltzmann machines, and is competitive with state-of-the-art neural network models.  ( 2 min )
    Revisiting Bellman Errors for Offline Model Selection. (arXiv:2302.00141v1 [cs.LG])
    Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.  ( 2 min )
    Local transfer learning from one data space to another. (arXiv:2302.00160v1 [cs.LG])
    A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related.  ( 2 min )
    A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee. (arXiv:2302.00248v1 [cs.DS])
    Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.  ( 2 min )
    Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression. (arXiv:2302.00257v1 [cs.LG])
    In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$ with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.  ( 2 min )
    Quickest Change Detection for Unnormalized Statistical Models. (arXiv:2302.00250v1 [stat.ML])
    Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyv\"arinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.  ( 2 min )
    Accelerated First-Order Optimization under Nonlinear Constraints. (arXiv:2302.00316v1 [math.OC])
    We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive rates for the convex setting. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.  ( 2 min )
    Gradient Descent in Neural Networks as Sequential Learning in RKBS. (arXiv:2302.00205v1 [stat.ML])
    The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training.  ( 2 min )
    Training Normalizing Flows with the Precision-Recall Divergence. (arXiv:2302.00628v1 [cs.LG])
    Generative models can have distinct mode of failures like mode dropping and low quality samples, which cannot be captured by a single scalar metric. To address this, recent works propose evaluating generative models using precision and recall, where precision measures quality of samples and recall measures the coverage of the target distribution. Although a variety of discrepancy measures between the target and estimated distribution are used to train generative models, it is unclear what precision-recall trade-offs are achieved by various choices of the discrepancy measures. In this paper, we show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the {\em PR-divergences }. Conversely, any -divergence can be written as a linear combination of PR-divergences and therefore correspond to minimising a weighted precision-recall trade-off. Further, we propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.  ( 2 min )
    Deep learning for $\psi$-weakly dependent processes. (arXiv:2302.00333v1 [stat.ML])
    In this paper, we perform deep neural networks for learning $\psi$-weakly dependent processes. Such weak-dependence property includes a class of weak dependence conditions such as mixing, association,$\cdots$ and the setting considered here covers many commonly used situations such as: regression estimation, time series prediction, time series classification,$\cdots$ The consistency of the empirical risk minimization algorithm in the class of deep neural networks predictors is established. We achieve the generalization bound and obtain a learning rate, which is less than $\mathcal{O}(n^{-1/\alpha})$, for all $\alpha > 2 $. Applications to binary time series classification and prediction in affine causal models with exogenous covariates are carried out. Some simulation results are provided, as well as an application to the US recession data.  ( 2 min )
    Robust online active learning. (arXiv:2302.00422v1 [stat.ML])
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.  ( 2 min )
    The geometry of hidden representations of large transformer models. (arXiv:2302.00294v1 [cs.LG])
    Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.  ( 2 min )

  • Open

    Creating "Her" using GPT-3 & TTS trained on voice from movie
    submitted by /u/justLV [link] [comments]  ( 40 min )
    A look at Rewind.AI - the search engine for life?
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    How to make an AI learn osu!? (for school project year)
    Hello, I am currently in Highschool and we have a Project Year in computer science. I recently got really interested in AI and especially AI's that are able to learn how to play a specific game. You may or may not know Vedal987 but I am taking huge inspiration from him on how his AI "Neuro-sama" is able to play osu! I want to do something similar with the same game but, I really am a beginner with AI's of that kind so I wanted to ask if you have any learning materials for me or guides on how to start and all the stuff that I could need to actually execute my plan (and if it even is possible within a year) I am an expert in Python (also good in Java and Javascript), I can code and stuff but have no clue how to start out since I've never done this. Thanks in advance :) (feel free to ask me if you dont understand something, i am bad in expressing myself) submitted by /u/iLeg1999 [link] [comments]  ( 41 min )
    We built a mobile app powered by GPT, specialising in daily task assistance 🚀
    Hey folks! 👋 I am Ed, one of the creators of Toucan. Super excited to share what we've been working on 👇 Toucan is a mobile app that helps you get the most out of GPT-3 “on the go” – with a slick mobile UX, powerful wrap-around features and fine-tuned models. If you're keen to learn a bit more about our journey, you can read on below, of if you just want to throw your best queries at GPT, you can download Toucan on the App Store, here: https://apps.apple.com/us/app/toucan-ai-chatbot-assistant/id1665298806 🧩 Our story We started Toucan to make engaging with cutting-edge AI on mobile devices a vastly better experience. No more browser hopping, endless auth challenges, conversation loss or clunky UI interactions. We’ve also built handy supporting functionalities around the AI layer (s…  ( 42 min )
    1-click deploy for your GPT-3 App
    Link - https://github.com/ClerkieAI/berri_ai We made a package that makes it easy for you to quickly deploy your LLM Agent from Google Colab to production (Web App and API Endpoint). How it works? Just install the package, import the function, and run deploy. At the end of the deploy (~10-15mins), you will get: A web app to interact with your agent 👉 https://agent-repo-35aa2cf3-a0a1-4cf8-834f-302e5b7fe07e-4524... An endpoint you can query 👉 https://agent-repo-35aa2cf3-a0a1-4cf8-834f-302e5b7fe07e-4524... is obama?" Want a more detailed walkthrough? Check out our loom - https://www.loom.com/share/fd4375b4a77f4ea7802369cb06a16d43 We’re still early so would love your feedback and opinions. Feel free to try us out for free – and if you need help building an agent / want a specific integration, just let us know! https://i.redd.it/xu6a92464ufa1.gif submitted by /u/VideoTo [link] [comments]  ( 41 min )
    AI Dream 124 - Great Relaxing Psychedelic AI Video
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI tool for typography?
    I'm wondering if such a thing exists. Say I have the word "Eyeball" or anything at all. I want to find a font that makes the word "Eyeball" look good, while adding parameters such as - professional, business like, etc - possibly aiding in its suggestions. I already have been using Chat GPT to create concepts with font suggestions, but it would be neat if there was a dedicated tool for this. Is there currently any AI out there that can do that? submitted by /u/Pinkisacoloryes [link] [comments]  ( 41 min )
    is there an ai for helping YOU learn?
    Ive heard of all types of ai but is there one every conceived of helping you learn fast and quicker? submitted by /u/mrmaskfawkes [link] [comments]  ( 41 min )
    Best ChatGPT Alternatives for 2023
    submitted by /u/visimens-technology [link] [comments]  ( 40 min )
    Join the AI Art Revolution! Help Create a Masterpiece and Explore the Boundaries of Technology and Creativity.
    Hey reddit, I'm a student that is taking part in a contest focusing on banned works of art, and the relationship between art and society. After trying my hand at Chat GPT and DALLE2 I've spurred an idea for the topic of this contest. A couple details, the contest is in the form of a comic book page, it's surrounding the themes of obscenity, libel, codes, copyrights, etc. I was considering using AI as the subject matter considering the recent controversies surrounding this new technology. Questions such as whose art is it? The person using the technology, or the person owning it? Is it copyrightable considering it's fed data based on previous artists work? I'm relatively new to this space other than witnessing peoples uses on various social media. So given that, my idea was to have Chat GPT generate a fictional story surrounding the relationship of society and art, tweaking until I have something that can be applied into a comic book form, then feeding parameters into DALLE2 to create comic book panels. Given the controversial nature of utilizing AI for school related essays, homework, etc I was curious what this community thinks of this idea? I felt like it was one two punch, utilizing new technology to create artwork that's relevant to the topic at hand and also has the potential to be "banned" so to speak from participating in the contest. submitted by /u/KylesButler [link] [comments]  ( 42 min )
    AI is terrifying if you actually think about what we can do with it ... it made me question what it means to even be human
    submitted by /u/andytk33 [link] [comments]  ( 43 min )
    I Made a List of The 5 Best AI Detection Tools
    submitted by /u/HODLTID [link] [comments]  ( 40 min )
    📌[Searchcolab] League of Legends characters at 80's Dark Fantasy Movie. SafeTensor Link in comments.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    📌[Searchcolab] What is the reason behind the noticeable similarity in quality between the voice cloning results generated by Elevenlabs and Microsoft VALL-E? Unofficial implementation of Microsoft VALL-E present on Searchcolab.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    A.I Eminem - 'Raw Sh*t'
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    The Moravec Paradox is Not Paradoxical: Intelligence from the Primordial Soup of Signs
    submitted by /u/SonntagMorgen [link] [comments]  ( 40 min )
    🌌 Why A.I. is Increasingly a Game Changer in Astronomy and Cosmology
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    Chinese Company Gearing Up to Release Powerful ChatGPT Competitor
    submitted by /u/Mental_Character7367 [link] [comments]  ( 44 min )
    Join the experiment: exploring a new approach to emotional support with the AI Friend/Counselor Bot
    I've been working on a project that uses the OpenAI API to create a friend/counselor bot. The goal of this bot is to provide emotional support and have meaningful conversations about a variety of topics including work, relationships, emotions, health, and more. If you're interested, you can check out the demo of the bot at https://friendbot.ai. I'm looking for alpha testers to help me improve the features such as handling long interactions, sematics etc. Participation in the experiment is anonymous and no data about interactions are stored on the servers. submitted by /u/M-K111 [link] [comments]  ( 41 min )
    AI writes video about itself. The Rise of AI Art: A Creative Revolution.
    submitted by /u/anekii [link] [comments]  ( 40 min )
    Write Faster SQL With AI
    submitted by /u/EloquentPickle [link] [comments]  ( 40 min )
    Pokémon as an 80’s Dark Fantasy Film (AI Generated Art)
    submitted by /u/HooverHooverHoober [link] [comments]  ( 40 min )
    Web 3.0:Era of condensed knowledge
    Yandex(Russian search engine),Google,Bing and Baidu race to release chatbots as replacement for traditional search. Here is what I think will happen: Most websites will die out. Why?You don't need 10000 recipe websites.You need culinary Wikipedia.And a chatbot to query it.99% percent of recipes are not unique.Culinary theory is pretty established.They can be condensed into branching stricture.e.g. Pasta with tomatoes.Pasta with cheese. Plumbing.Pretty much the same. Programming.All you need is documentation pretty much.You can even analyse open source projects and generate documentation on the fly.Tutorials and best practices can be condensed. News.I was thinking about extracting news from data collected from ambient computing.E.g. Google develops new feature.News article is generated from Todo lists,meeting notes,code.But it's too far out for now. And list goes on and on. You need 100 Wikipedia to contain all human knowledge.And even if websites will refuse to cooperate it doesn't matter.There is plenty of copies of the web. submitted by /u/nikitastaf1996 [link] [comments]  ( 44 min )
    AI's Effects on/Associations with cognitive processes, memory, and other importance in psychology
    Hello! Does anyone know of a decent scholarly article on something relevant to this that they could share with me? ​ Thanks! :) submitted by /u/EvilPeppermintHelix [link] [comments]  ( 40 min )
    How do you find new ai-based software/programs?
    There are numerous amazing AI-powered software/programs available, such as Wave2Lip on GitHub (which enables you to lip sync a video to any voice audio file), CascadeOur (which generates AI animations), or Google Film (which creates a seamless slow-motion effect from two distant frames). These programs possess tremendous potential, but how can I discover new ones? I'm sure there are many more out there that I am not aware of. Can you suggest any tips on how I can find them? submitted by /u/PM_ME_LIFE_MEANING [link] [comments]  ( 41 min )
  • Open

    Lagrange multiplier setup: Now what?
    Suppose you need to optimize, i.e. maximize or minimize, a function f(x). If this is a practical problem and not a textbook exercise, you probably need to optimize f(x) subject to some constraint on x, say g(x) = 0. Hmm. Optimize one function subject to a constraint given by another function. Oh yeah, Lagrange multipliers! […] Lagrange multiplier setup: Now what? first appeared on John D. Cook.  ( 6 min )
  • Open

    Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS
    Today, the NFL is continuing their journey to increase the number of statistics provided by the Next Gen Stats Platform to all 32 teams and fans alike. With advanced analytics derived from machine learning (ML), the NFL is creating new ways to quantify football, and to provide fans with the tools needed to increase their […]  ( 10 min )
    Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab
    The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More […]  ( 10 min )
  • Open

    Google Research, 2022 & beyond: ML & computer systems
    Posted by Phitchaya Mangpo Phothilimthana, Staff Research Scientist, and Adam Paszke, Staff Research Scientist, Google Research (This is Part 3 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) Great machine learning (ML) research requires great systems. With the increasing sophistication of the algorithms and hardware in use today and with the scale at which they run, the complexity of the software necessary to carry out day-to-day tasks only increases. In this post, we provide an overview of the numerous advances made across Google this past year in systems for ML that enable us to support the serving and training of complex models while easing the complexity of implementation for end users. This b…  ( 97 min )
    Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization
    Posted by Xingyou (Richard) Song, Research Scientist, and Chansoo Lee, Software Engineer, Google Research, Brain Team Google Vizier is the de-facto system for blackbox optimization over objective functions and hyperparameters across Google, having serviced some of Google’s largest research efforts and optimized a wide range of products (e.g., Search, Ads, YouTube). For research, it has not only reduced language model latency for users, designed computer architectures, accelerated hardware, assisted protein discovery, and enhanced robotics, but also provided a reliable backend interface for users to search for neural architectures and evolve reinforcement learning algorithms. To operate at the scale of optimizing thousands of users’ critical systems and tuning millions of machine learning…  ( 91 min )
  • Open

    [D] I'm at a crossroads: Bayesian methods VS Reinforcement Learning, which to choose?
    I know it looks like a very absurd comparison. The reason why I ask this is because I'm taking a masters of ML(that wasn't exactly cheap) and while the basic concepts are covered in the mandatory path, now I have to choose an elective, in this case between "Bayesian Methods" and "Reinforcement Learning". So what I'd like to ask you is which of both concepts/techniques are more used or relevant to work in the Data Science industry, meaning which one is going to help me more in that job or make a more positive impression in my resume. Thx in advance! submitted by /u/fuscarili [link] [comments]  ( 46 min )
    [P] Domestic Violence Dataset
    Hi, I am working on project and for that I need a Twitter Domestic Violence Dataset. Basically I need a dataset with domestic violence tweets against woman. I have searched Kaggle and other websites but found no luck. Plus, I tried using Snscrape, but I need some phrases ideas related to domestic violence so I can get some tweets using that. I tried "Domestic Violence" , "My husband tried to kill me" and looking for more. Help is appreciated. submitted by /u/Naive-Aioli4849 [link] [comments]  ( 43 min )
    [D] Workflow chair for AI conference
    Hi! Does anyone here have experience working as a workflow chair for major conferences? What are the duties and how much does it pay? (I heard that it's a paid role) submitted by /u/Expensive-Track [link] [comments]  ( 42 min )
    [p] I built an open source platform to deploy computationally intensive Python functions as serverless jobs, with no timeouts
    Hi friends! I ran into this problem enough times at my last few jobs that I built a tool to solve it. I spent many hours building Docker containers for my Python functions, as many of the data science modules required building C libraries (since they significantly speed up compute-intensive routines, such as math calculations). Deploying the containers to AWS Lambda or Fargate (if the processes required more CPU or memory or were >15 minutes) and wiring functions to talk to each other using queues, databases, and blob storage made iterating on the actual code, which wasn't even that complex most of the time, slow. I made cakework https://github.com/usecakework/cakework, a platform that lets you spin up your Python functions as serverless, production-scale backends with a single command. Using the client SDK, you submit requests, check status, and get results. You can also specify the amount of CPU (up to 16 cores) and memory (up to 128GB) for each individual request, which is helpful when your data size and complexity varies across different requests. A common pattern that I built cakework for is doing file processing for ML: - ingest data from some source daily, or in response to an external event (data written to blob storage) - run my function (often using pandas/numpy/scipy) - write results to storage, update database - track failures and re-run/fix It's open source <3. Here are some fun examples to get you started: https://docs.cakework.com/examples Would love to hear your thoughts! submitted by /u/seattleite849 [link] [comments]  ( 47 min )
    [P] Time series outlier / anomaly detection
    I have traffic speed time series data for each day of the week over several months, with data samples about every 30 seconds. I'd like to find periods of time (subsequences) where the speed is much slower than usual. Any recommendations for algorithms that would be well suited to this problem? Thanks submitted by /u/dudester_el [link] [comments]  ( 44 min )
    [D] Querying with multiple vectors during embedding nearest neighbor search?
    Are there tools or techniques that permit you to joint query using more than one query vector? Use case: iterative ANN search refinement, where I start with a seed vector, select matches, and re-query with more examples to improve the search results. I tried doing this with FAISS, but it performs a "batch query" that returns a separate set of results for each query vector (not a joint query). submitted by /u/mostlyhydrogen [link] [comments]  ( 42 min )
    [D] Do high leverage points affect Neural Net and Tree-based model?
    I know they can affect linear regression badly but given the fact that neural net and tree-based models can approximate non-linear complex functions, I don't think the high leverage points would be a problem. Just curious about your opinion whether my thinking makes sense submitted by /u/Temporary_Cap_2855 [link] [comments]  ( 43 min )
    [D] ImageNet normalization vs [-1, 1] normalization
    For ImageNet classification, there are two common ways of normalizing the input images: - Normalize to [-1, 1] using an affine transformation (2*(x/255) - 1). - Normalize using ImageNet mean = (0.485, 0.456, 0.406) and std = (0.229, 0.224, 0.225). I observe that the first one is more common in TensorFlow codebases (including Jax models with TensorFlow data processing, e.g. the official Vision Transformers code), whereas the second is ubiquitous in PyTorch codebases. I tried to find empirical comparisons of the two, but there doesn't seem to be any. Which one is better in your opinion? I guess the performance shouldn't be too different, but still it's interesting to hear your experience. submitted by /u/netw0rkf10w [link] [comments]  ( 44 min )
    [D] PC takes a long time to execute code, possibility to use a cloud/external device?
    Hello people, I am currently attending a Data Science course and to finish I have to write a paper about a project that I am currently working on. I write the code in VSCode and I use .ipynb notebooks. So I am basically training a few ML models after a long data preprocessing which worked out fine. But as soon as I run my hyperparameter tuning code, my PC takes a lot of time. Right now I am running hyperparameter tuning for RandomForest and it already runs for 21 hours. Is there any possibility for me to run my code somewhere else? I read abour Heroku, but that seems to be too much than what I am looking for. I am getting a bit nervous, because I want to get this paper done. The worst case is that I have to buy a new PC. Thank you so much! submitted by /u/Emergency-Dig-5262 [link] [comments]  ( 45 min )
    [N] Microsoft integrates GPT 3.5 into Teams
    Official blog post: https://www.microsoft.com/en-us/microsoft-365/blog/2023/02/01/microsoft-teams-premium-cut-costs-and-add-ai-powered-productivity/ Given the amount of money they pumped into OpenAI, it's not surprising that you'd see it integrated into their products. I do wonder how this will work in highly regulated fields (finance, law, medicine, education). submitted by /u/bikeskata [link] [comments]  ( 47 min )
    [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples?
    Aligned LLMs such as InstructGPT and ChatGPT are trained via supervised fine-tuning after the initial self-supervised pretraining. Then, the researchers train a reward model on responses ranked by humans. When I understand correctly, they let the LLM generate responses that humans have to rank on a scale from 1-5. Then, they train a reward model (I suppose in supervised fashion?) on these ranked outputs. Once that's done, they use reinforcement learning (RL) with proximal policy optimization (PPO) to update the LLM. My question is why they use RL with PPO for this last step? Why don't they fine-tune the LLM using regular supervised learning, whereas the human-ranked outputs represent the labels. Since these are labels in the range 1-5, this could be a ranking or ordinal regression loss for supervised learning. submitted by /u/alpha-meta [link] [comments]  ( 47 min )
    [D] Inconsistent Featurespace in Data
    Hi colleagues! I am working on a model for which I have a dataset consisting of 2 data sources. Problem is that one datastream starts in 2017 and the other only in 2022. Feature spaces from those 2 data streams are different. I am wondering if there is a methodology to follow which allows me to use both data streams for training even though one starts way later than the other. Or am I forced to drop the newer one? (just 2022 data from two sources is too small for me to train on) Thank you! submitted by /u/pahalie [link] [comments]  ( 43 min )
    [D] Commercial Use of a Model that has been trained using Human3.6M
    I wanted to use the Learnable Trainangulation model in a commercial project. The source code itself is under MIT licensing. However, the dataset they have used is Human3.6M, which states that the license is "FREE OF CHARGE FOR ACADEMIC USE ONLY". Yet, recent court rulings (in the US) state that models can use copyrighted data during training, and the results are no longer bound by that copyright (e.g. Google Books). Does the same apply here? submitted by /u/mfarahmand98 [link] [comments]  ( 42 min )
    [D] Global Optimum of K-Means Cost Function
    I've recently started reading up on classical ML and I got a question about K-Means. More concretely, I am confused about the uniqueness of the global optimal solution of K-Means's cost function. Let's state the problem formally below, extracted from Bishop's Pattern Recognition and Machine Learning book, exercise 9.1. Consider the 𝐾-means algorithm discussed in Section 9.1. Show that as a consequence of there being a finite number of possible assignments for the set of discrete indicator variables 𝑟𝑛𝑘, and that for each such assignment there is a unique optimum for the 𝝁𝑘, the K-means algorithm must converge after a finite number of iterations. I made an answer [here](https://stats.stackexchange.com/questions/603327/question-on-the-proof-of-convergence-of-k-means) detailing the proof of why it does converge in Lloyd's algorithm, but I think I still do not understand why Lloyd's do not converge to a global minimum, which mathematical theorem/understanding am I missing here? I think that optimizing both the assignments and the centroids of K-Means at the same time is non-convex and hence there are many local minimums, we can use brute force to search for the global minimum but of course it is exponential to the number of data points. On the other hand, Lloyd optimizes it (greedily) alternatively, and hence you will find the cost functions' local minima (guaranteed)? submitted by /u/healthymonkey100 [link] [comments]  ( 45 min )
    [P] [R] A simplistic UI to edit images with Stable Diffusion and InstructPix2Pix
    https://preview.redd.it/ut4us5251rfa1.png?width=2000&format=png&auto=webp&s=dbf79c3832b20287203faa97e5c1303472bdbc22 Currently, the UI supports a picture upload and uses InstructPix2Pix to edit it. Also, it uses upscaling models for quality enhancements. More models are coming soon. The goal is to provide a way for non-ML people to use diffusion-based image editing through simplistic app design. Web demo: https://diffground.com/ submitted by /u/radi-cho [link] [comments]  ( 43 min )
    [D]How Will Open Source Alternatives Compete With GPT3?
    To clarify, I'm not talking about ChatGPT here. I've been testing outputs from GPT-3 davinci003 against alternatives in terms of output quality, relevance, and ability to understand "instruct" (versus vanilla autocompletion). I tried these: AI21 Jurassic 178B NeoX 20B GPT J 6B FairSeq 13B As well as: GPT-3 davinci002 GPT-3 davinci001 Of course, I didn't expect the smaller models to be on par with GPT-3, but I was surprised at how much better GPT3 davinci 003 performed compared to AI21's 178B model. AI21's Jurassic 178B seems to be comparable to GPT3 davinci 001. Does this mean that only well-funded corporations will be able to train general-purpose LLMs? It seems to me that just having a large model doesn't do much, it's also about several iterations of training and feedback. How are open source alternatives going to be able to compete? (I'm not in the ML or CS field, just an amateur who enjoys using these models) submitted by /u/noellarkin [link] [comments]  ( 45 min )
    [R] Sentence autoencoder
    Any suggestion on sentence auto-encoder? I want to learn the vector representation of a sentence and reconstruct the sentence itself. I used plane LSTM with self attention in the encoder and decoder architecture (no cross attention) but the results are not that good enough. I can not use cross attention i.e decoder will not have access to all the outputs of encoder but only access to the bottleneck latent vector. BART have pre-trained in similar manner but I don't know if we can pre-train that model to fit my use case. This is just a module of other work, after pre-training the sentence-sentence auto-encoder, I need to add some more module in between them, so I should have encoder and decoder separable, which I think can not be done in BART as well. Any direction would be much appreciated. Thank you submitted by /u/Bishwa12 [link] [comments]  ( 43 min )
    [D] Apple's ane-transformers - experiences?
    I'm using Huggingface's transformers regularly for experimentations, but I plan to deploy some of the models to iOS. I have found ml-ane-transformers repo from Apple, which shows how transformers can be rewritten to have much better performance on Apple's devices. There's an example of DistilBERT implemented in that optimized way. As I plan to deploy transformers to iOS, I started thinking about this. I'm hoping some already have experience about this, so we can discuss: Has anyone tried this themselves? Do they actually see the improvements in performance on iOS? I'm using Huggingface's transformer models in my experiments. How much work do you think there is to rewrite model in this optimized way? It's very difficult to train transformers from scratch (especially if they're big :) ), so I'm fine-tuning on top of pre-trained models on Huggingface. Is it possible to use weights from pretrained Huggingface models with the Apple's reference code? How difficult is it? submitted by /u/alkibijad [link] [comments]  ( 43 min )
  • Open

    Best Reinforcement Learning Papers from the past 1-2 years
    I am searching for the best Reinforcement Learning Papers from the past 1-2 years. My special focus is on Communications, but as we all know a lot of approaches can be applied in different fields. I would appreciate any recommendations :) submitted by /u/GolemX14 [link] [comments]  ( 41 min )
    Multi-Agent Stable Baselines
    I want to extend an implementation that currently uses stable baselines 3 from a single-agent into a multi-agent system. As far as I can tell, stable baselines isn't really suited for this. Does anyone have experience with multi-agent systems in stable baselines or with switching from stable baselines to RLlib? submitted by /u/tessherelurkingnow [link] [comments]  ( 41 min )
    "Distillation Policy Optimization", Ma et al 2023
    https://arxiv.org/abs/2302.00533 submitted by /u/OutOfCharm [link] [comments]  ( 41 min )
    Where do the new weights in PPO come from?
    I am looking at the pseudocode for PPO given in SpinningUp over here - ​ https://preview.redd.it/sx3kx08k2pfa1.png?width=1013&format=png&auto=webp&s=85511fe4cccda77975bd4731558f3936c9f5f522 ​ I understand that $\theta_k$ are the old weights. I am a little lost regarding the weight $\theta$. Is this the new weight? ​ Here is what I guess is happening (I could be completely wrong) - PPO is based on a generic Actor Critic Algorithm Therefore the new weight $\theta$ is learned as per conventional A-C algorithms The old weights $\theta_k$ and new weights $\theta$ are stored and then used in the clip objective The clip objective uses these 2 weights to come up with $\theta_{k+1}$ submitted by /u/Academic-Rent7800 [link] [comments]  ( 43 min )
    Noob question: why is this trivial problem not accordingly trivial to train? (PPO)
    I'm new to ML/RL and trying to build some intuition on how the fancy algorithms are made useful in practice. So far I was surprised how pairing SB3 + a gym environment frequently does nothing unless the right parameters are hit. So I've written a trivial mock environment: it "thinks" of a random target floating point number between -1 and 1 and a random state number in the same range. The observation (-2 to 2) is the difference between both. The (continous -1 to 1) action is added to the state The reward is the absolute value of the distance between current state and the target times -1, with a little negative bit added to "make the life painful" and enforce finishing Done is signaled once converged within 0.001, which is achievable within 1 to max. 2 steps depending on the reset state. So my thinking is: this is as easy as it gets - it does not matter how the algorithm explores or what it sees, the input to output relation is always exactly same. The actor simply needs to learn to forward the input to the output, which is seen learnable from every possible action. So it should train instantly with almost random choices of most hyperparameters - reward discount would not really matter, neither would most of the others, right? (I use PPO from SB3.) Yet I almost could not get it to work: it would behave wildly different with different n_env*n_steps sizes, learning and randomly unlearning again even with small learning_rates set. The cases I could find that work manually would take total_timesteps in a range of millions to finish consistently, and even then they would finish mostly in 2-4 steps rather then 1-2 in evaluation. Is such a problem really such sensitive or am I doing it wrong? Is there a way to actually train it within 10000s rather then millions of steps with PPO and if so, how to find from the train metrics? How could it be improved to converge more precisely? For anyone wanting to play around, code is attached below. submitted by /u/EvilButFluffy [link] [comments]  ( 45 min )
  • Open

    Three Cheers: GFN Thursday Celebrates Third Anniversary With 25 New Games
    Cheers to another year of cloud gaming! GeForce NOW celebrates its third anniversary with a look at how far cloud gaming has come, a community celebration and 25 new games supported in February. Members can celebrate all month long, starting with a sweet Dying Light 2 reward and support for nine more games this week, Read article >  ( 7 min )
    NVIDIA A100 Aces Throughput, Latency Results in Key Inference Benchmark for Financial Services Industry
    NVIDIA A100 Tensor Core GPUs running on Supermicro servers have captured leading results for inference in the latest STAC-ML Markets benchmark, a key technology performance gauge for the financial services industry. The results show NVIDIA demonstrating unrivaled throughput — serving up thousands of inferences per second on the most demanding models — and top latency Read article >  ( 6 min )
    Survey Reveals Financial Industry’s Top 4 AI Priorities for 2023
    For several years, NVIDIA has been working with some of the world’s leading financial institutions to develop and execute a wide range of rapidly evolving AI strategies. For the past three years, we’ve asked them to tell us collectively what’s on the top of their minds. Sometimes the results are just what we thought they’d Read article >  ( 6 min )
  • Open

    Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization. (arXiv:2210.12057v2 [cs.LG] UPDATED)
    We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
    Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation. (arXiv:2210.04296v3 [cs.LG] UPDATED)
    Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a partial differential equation (PDE) governing the spatial-temporal evolution of a density undergoing a diffusion process. In this work, we derive a corresponding equation, called the score FPE that characterizes the noise-conditional scores of the perturbed data densities (i.e., their gradients). Surprisingly, despite impressive empirical performance, we observe that scores learned via denoising score matching (DSM) do not satisfy the underlying score FPE. We prove that satisfying the FPE is desirable as it improves the likelihood and the degree of conservativity. Hence, we propose to regularize the DSM objective to enforce satisfaction of the score FPE, and we show the effectiveness of this approach across various datasets.
    Prioritizing Samples in Reinforcement Learning with Reducible Loss. (arXiv:2208.10483v2 [cs.LG] UPDATED)
    Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. Not all samples carry the same amount of significance and simply assigning equal importance to each of the samples is a na\"ive strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in prioritized experience replay.
    ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models. (arXiv:2301.12935v2 [cs.LG] UPDATED)
    Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous sampling methods with fixed analytical form are not robust with the error in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise. Experiments on Cifar10, LSUN-Church, and LSUN-Bedroom datasets demonstrate that our proposed ERA-Solver achieves 5.14, 9.42, and 9.69 Fenchel Inception Distance (FID) for image generation, with only 10 network evaluations.
    Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers. (arXiv:2212.04325v2 [eess.AS] UPDATED)
    Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid modes, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance.
    Accelerating Material Design with the Generative Toolkit for Scientific Discovery. (arXiv:2207.03928v4 [cs.LG] UPDATED)
    With the growing availability of data within various scientific domains, generative models hold enormous potential to accelerate scientific discovery. They harness powerful representations learned from datasets to speed up the formulation of novel hypotheses with the potential to impact material discovery broadly. We present the Generative Toolkit for Scientific Discovery (GT4SD). This extensible open-source library enables scientists, developers, and researchers to train and use state-of-the-art generative models to accelerate scientific discovery focused on material design.
    Learning to reject meets OOD detection: Are all abstentions created equal?. (arXiv:2301.12386v2 [cs.LG] UPDATED)
    Learning to reject (L2R) and out-of-distribution (OOD) detection are two classical problems, each of which involve detecting certain abnormal samples: in L2R, the goal is to detect "hard" samples on which to abstain, while in OOD detection, the goal is to detect "outlier" samples not drawn from the training distribution. Intriguingly, despite being developed in parallel literatures, both problems share a simple baseline: the maximum softmax probability (MSP) score. However, there is limited understanding of precisely how these problems relate. In this paper, we formally relate these problems, and show how they may be jointly solved. We first show that while MSP is theoretically optimal for L2R, it can be theoretically sub-optimal for OOD detection in some important practical settings. We then characterize the Bayes-optimal classifier for a unified formulation that generalizes both L2R and OOD detection. Based on this, we design a plug-in approach for learning to abstain on both inlier and OOD samples, while constraining the total abstention budget. Experiments on benchmark OOD datasets demonstrate that our approach yields competitive classification and OOD detection performance compared to baselines from both literatures.
    Continuous Soft Pseudo-Labeling in ASR. (arXiv:2211.06007v2 [cs.LG] UPDATED)
    Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.
    Spectral Maps for Learning on Subgraphs. (arXiv:2205.14938v4 [cs.LG] UPDATED)
    In graph learning, maps between graphs and their subgraphs frequently arise. For instance, when coarsening or rewiring operations are present along the pipeline, one needs to keep track of the corresponding nodes between the original and modified graphs. Classically, these maps are represented as binary node-to-node correspondence matrices and used as-is to transfer node-wise features between the graphs. In this paper, we argue that simply changing this map representation can bring notable benefits to graph learning tasks. Drawing inspiration from recent progress in geometry processing, we introduce a spectral representation for maps that is easy to integrate into existing graph learning models. This spectral representation is a compact and straightforward plug-in replacement and is robust to topological changes of the graphs. Remarkably, the representation exhibits structural properties that make it interpretable, drawing an analogy with recent results on smooth manifolds. We demonstrate the benefits of incorporating spectral maps in graph learning pipelines, addressing scenarios where a node-to-node map is not well defined, or in the absence of exact isomorphism. Our approach bears practical benefits in knowledge distillation and hierarchical learning, where we show comparable or improved performance at a fraction of the computational cost.
    STEEL: Singularity-aware Reinforcement Learning. (arXiv:2301.13152v2 [stat.ML] UPDATED)
    Batch reinforcement learning (RL) aims at finding an optimal policy in a dynamic environment in order to maximize the expected total rewards by leveraging pre-collected data. A fundamental challenge behind this task is the distributional mismatch between the batch data generating process and the distribution induced by target policies. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice, especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable the power of model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, STEEL only requires some minimal data-coverage assumption and thus greatly enhances the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method when facing possible singularity in batch RL.
    Active Sequential Two-Sample Testing. (arXiv:2301.12616v2 [cs.LG] UPDATED)
    Two-sample testing tests whether the distributions generating two samples are identical. We pose the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. We devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries} sample labels to address the problem. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is given by a classification model. The classification model is adaptively updated and then used to guide an active query scheme called bimodal query to label sample features in the regions with high dependency between the feature variables and the label variables. The theoretical contributions in the paper include proof that our framework produces an \emph{anytime-valid} $p$-value; and, under reachable conditions and a mild assumption, the framework asymptotically generates a minimum normalized log-likelihood ratio statistic that a passive query scheme can only achieve when the feature variable and the label variable have the highest dependence. Lastly, we provide a \emph{query-switching (QS)} algorithm to decide when to switch from passive query to active query and adapt bimodal query to increase the testing power of our test. Extensive experiments justify our theoretical contributions and the effectiveness of QS.
    What can be learnt with wide convolutional neural networks?. (arXiv:2208.01003v4 [stat.ML] UPDATED)
    Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.
    Causal Estimation for Text Data with (Apparent) Overlap Violations. (arXiv:2210.00079v2 [stat.ML] UPDATED)
    Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.
    Fairness-aware Vision Transformer via Debiased Self-Attention. (arXiv:2301.13803v1 [cs.CV])
    Vision Transformer (ViT) has recently gained significant interest in solving computer vision (CV) problems due to its capability of extracting informative features and modeling long-range dependencies through the self-attention mechanism. To fully realize the advantages of ViT in real-world applications, recent works have explored the trustworthiness of ViT, including its robustness and explainability. However, another desiderata, fairness has not yet been adequately addressed in the literature. We establish that the existing fairness-aware algorithms (primarily designed for CNNs) do not perform well on ViT. This necessitates the need for developing our novel framework via Debiased Self-Attention (DSA). DSA is a fairness-through-blindness approach that enforces ViT to eliminate spurious features correlated with the sensitive attributes for bias mitigation. Notably, adversarial examples are leveraged to locate and mask the spurious features in the input image patches. In addition, DSA utilizes an attention weights alignment regularizer in the training objective to encourage learning informative features for target prediction. Importantly, our DSA framework leads to improved fairness guarantees over prior works on multiple prediction tasks without compromising target prediction performance
    Gaussian Noise is Nearly Instance Optimal for Private Unbiased Mean Estimation. (arXiv:2301.13850v1 [math.ST])
    We investigate unbiased high-dimensional mean estimators in differential privacy. We consider differentially private mechanisms whose expected output equals the mean of the input dataset, for every dataset drawn from a fixed convex domain $K$ in $\mathbb{R}^d$. In the setting of concentrated differential privacy, we show that, for every input such an unbiased mean estimator introduces approximately at least as much error as a mechanism that adds Gaussian noise with a carefully chosen covariance. This is true when the error is measured with respect to $\ell_p$ error for any $p \ge 2$. We extend this result to local differential privacy, and to approximate differential privacy, but for the latter the error lower bound holds either for a dataset or for a neighboring dataset. We also extend our results to mechanisms that take i.i.d.~samples from a distribution over $K$ and are unbiased with respect to the mean of the distribution.
    FedPass: Privacy-Preserving Vertical Federated Deep Learning with Adaptive Obfuscation. (arXiv:2301.12623v2 [cs.DC] UPDATED)
    Vertical federated learning (VFL) allows an active party with labeled feature to leverage auxiliary features from the passive parties to improve model performance. Concerns about the private feature and label leakage in both the training and inference phases of VFL have drawn wide research attention. In this paper, we propose a general privacy-preserving vertical federated deep learning framework called FedPass, which leverages adaptive obfuscation to protect the feature and label simultaneously. Strong privacy-preserving capabilities about private features and labels are theoretically proved (in Theorems 1 and 2). Extensive experimental result s with different datasets and network architectures also justify the superiority of FedPass against existing methods in light of its near-optimal trade-off between privacy and model performance.
    Personalized Decentralized Bilevel Optimization over Random Directed Networks. (arXiv:2210.02129v2 [stat.ML] UPDATED)
    Personalization and decentralization are two major lines of studies to realize practical federated learning in the real world. The aim of this study is to establish a general and unified approach that can solve these two problems simultaneously. In this work, we first propose a bilevel problem that can adapt to various personalization scenarios by allowing an arbitrary choice of two parameters: a client-wise outer-parameter representing heterogeneity, and a shared inner-parameter representing homogeneity across client data distributions. We then present an algorithm that can solve this bilevel problem in a decentralized manner by estimating gradients of clients' outer-costs with respect to their outer-parameters. We show that the proposed algorithm can be extended to handle a random directed network, which is one of the most robust decentralized communication classes. The proposed method achieves state-of-the-art performance on a personalization benchmark across various communication settings.
    On the Global Convergence of Fitted Q-Iteration with Two-layer Neural Network Parametrization. (arXiv:2211.07675v2 [cs.LG] UPDATED)
    Deep Q-learning based algorithms have been applied successfully in many decision making problems, while their theoretical foundations are not as well understood. In this paper, we study a Fitted Q-Iteration with two-layer ReLU neural network parameterization, and find the sample complexity guarantees for the algorithm. Our approach estimates the Q-function in each iteration using a convex optimization problem. We show that this approach achieves a sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^{2})$, which is order-optimal. This result holds for a countable state-spaces and does not require any assumptions such as a linear or low rank structure on the MDP.
    ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text. (arXiv:2301.13852v1 [cs.CL])
    ChatGPT has the ability to generate grammatically flawless and seemingly-human replies to different types of questions from various domains. The number of its users and of its applications is growing at an unprecedented rate. Unfortunately, use and abuse come hand in hand. In this paper, we study whether a machine learning model can be effectively trained to accurately distinguish between original human and seemingly human (that is, ChatGPT-generated) text, especially when this text is short. Furthermore, we employ an explainable artificial intelligence framework to gain insight into the reasoning behind the model trained to differentiate between ChatGPT-generated and human-generated text. The goal is to analyze model's decisions and determine if any specific patterns or characteristics can be identified. Our study focuses on short online reviews, conducting two experiments comparing human-generated and ChatGPT-generated text. The first experiment involves ChatGPT text generated from custom queries, while the second experiment involves text generated by rephrasing original human-generated reviews. We fine-tune a Transformer-based model and use it to make predictions, which are then explained using SHAP. We compare our model with a perplexity score-based approach and find that disambiguation between human and ChatGPT-generated reviews is more challenging for the ML model when using rephrased text. However, our proposed approach still achieves an accuracy of 79%. Using explainability, we observe that ChatGPT's writing is polite, without specific details, using fancy and atypical vocabulary, impersonal, and typically it does not express feelings.
    Learning useful representations for shifting tasks and distributions. (arXiv:2212.07346v2 [cs.LG] UPDATED)
    Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions? Our thesis is that such scenarios are better served by representations that are richer than those obtained with a single optimization episode. We support this thesis with simple theoretical arguments and with experiments utilizing an apparently na\"{\i}ve ensembling technique: concatenating the representations obtained from multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained with a single training run. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
    Adaptively Weighted Data Augmentation Consistency Regularization for Robust Optimization under Concept Shift. (arXiv:2210.01891v2 [cs.CV] UPDATED)
    Concept shift is a prevailing problem in natural tasks like medical image segmentation where samples usually come from different subpopulations with variant correlations between features and labels. One common type of concept shift in medical image segmentation is the "information imbalance" between label-sparse samples with few (if any) segmentation labels and label-dense samples with plentiful labeled pixels. Existing distributionally robust algorithms have focused on adaptively truncating/down-weighting the "less informative" (i.e., label-sparse in our context) samples. To exploit data features of label-sparse samples more efficiently, we propose an adaptively weighted online optimization algorithm -- AdaWAC -- to incorporate data augmentation consistency regularization in sample reweighting. Our method introduces a set of trainable weights to balance the supervised loss and unsupervised consistency regularization of each sample separately. At the saddle point of the underlying objective, the weights assign label-dense samples to the supervised loss and label-sparse samples to the unsupervised consistency regularization. We provide a convergence guarantee by recasting the optimization as online mirror descent on a saddle point problem. Our empirical results demonstrate that AdaWAC not only enhances the segmentation performance and sample efficiency but also improves the robustness to concept shift on various medical image segmentation tasks with different UNet-style backbones.
    Hierarchically Clustered PCA, LLE, and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v2 [cs.LG] UPDATED)
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.
    Weak Proxies are Sufficient and Preferable for Fairness with Missing Sensitive Attributes. (arXiv:2210.03175v2 [cs.LG] UPDATED)
    Evaluating fairness can be challenging in practice because the sensitive attributes of data are often inaccessible due to privacy constraints. The go-to approach that the industry frequently adopts is using off-the-shelf proxy models to predict the missing sensitive attributes, e.g. Meta [Alao et al., 2021] and Twitter [Belli et al., 2022]. Despite its popularity, there are three important questions unanswered: (1) Is directly using proxies efficacious in measuring fairness? (2) If not, is it possible to accurately evaluate fairness using proxies only? (3) Given the ethical controversy over inferring user private information, is it possible to only use weak (i.e. inaccurate) proxies in order to protect privacy? Our theoretical analyses show that directly using proxy models can give a false sense of (un)fairness. Second, we develop an algorithm that is able to measure fairness (provably) accurately with only three properly identified proxies. Third, we show that our algorithm allows the use of only weak proxies (e.g. with only 68.85%accuracy on COMPAS), adding an extra layer of protection on user privacy. Experiments validate our theoretical analyses and show our algorithm can effectively measure and mitigate bias. Our results imply a set of practical guidelines for practitioners on how to use proxies properly. Code is available at github.com/UCSC-REAL/fair-eval.
    Deep Reinforcement Learning for Cryptocurrency Trading: Practical Approach to Address Backtest Overfitting. (arXiv:2209.05559v6 [q-fin.ST] UPDATED)
    Designing profitable and reliable trading strategies is challenging in the highly volatile cryptocurrency market. Existing works applied deep reinforcement learning methods and optimistically reported increased profits in backtesting, which may suffer from the false positive issue due to overfitting. In this paper, we propose a practical approach to address backtest overfitting for cryptocurrency trading using deep reinforcement learning. First, we formulate the detection of backtest overfitting as a hypothesis test. Then, we train the DRL agents, estimate the probability of overfitting, and reject the overfitted agents, increasing the chance of good trading performance. Finally, on 10 cryptocurrencies over a testing period from 05/01/2022 to 06/27/2022 (during which the crypto market crashed two times), we show that the less overfitted deep reinforcement learning agents have a higher return than that of more overfitted agents, an equal weight strategy, and the S&P DBM Index (market benchmark), offering confidence in possible deployment to a real market.
    Unifying Generative Models with GFlowNets and Beyond. (arXiv:2209.02606v2 [cs.LG] UPDATED)
    There are many frameworks for deep generative modeling, each often presented with their own specific training algorithms and inference methods. Here, we demonstrate the connections between existing deep generative models and the recently introduced GFlowNet framework, a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories. Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models. Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.
    Can Persistent Homology provide an efficient alternative for Evaluation of Knowledge Graph Completion Methods?. (arXiv:2301.12929v2 [cs.LG] UPDATED)
    In this paper we present a novel method, $\textit{Knowledge Persistence}$ ($\mathcal{KP}$), for faster evaluation of Knowledge Graph (KG) completion approaches. Current ranking-based evaluation is quadratic in the size of the KG, leading to long evaluation times and consequently a high carbon footprint. $\mathcal{KP}$ addresses this by representing the topology of the KG completion methods through the lens of topological data analysis, concretely using persistent homology. The characteristics of persistent homology allow $\mathcal{KP}$ to evaluate the quality of the KG completion looking only at a fraction of the data. Experimental results on standard datasets show that the proposed metric is highly correlated with ranking metrics (Hits@N, MR, MRR). Performance evaluation shows that $\mathcal{KP}$ is computationally efficient: In some cases, the evaluation time (validation+test) of a KG completion method has been reduced from 18 hours (using Hits@10) to 27 seconds (using $\mathcal{KP}$), and on average (across methods & data) reduces the evaluation time (validation+test) by $\approx$ $\textbf{99.96}\%$.
    Bias in Machine Learning Models Can Be Significantly Mitigated by Careful Training: Evidence from Neuroimaging Studies. (arXiv:2205.13421v2 [cs.LG] UPDATED)
    Despite the great promise that machine learning has offered in many fields of medicine, it has also raised concerns about potential biases and poor generalization across genders, age distributions, races and ethnicities, hospitals, and data acquisition equipment and protocols. In the current study, and in the context of three brain diseases, we provide evidence which suggests that when properly trained, machine learning models can generalize well across diverse conditions and do not necessarily suffer from bias. Specifically, by using multi-study magnetic resonance imaging consortia for diagnosing Alzheimer's disease, schizophrenia, and autism spectrum disorder, we find that well-trained models have a high area-under-the-curve (AUC) on subjects across different subgroups pertaining to attributes such as gender, age, racial groups, and different clinical studies and are unbiased under multiple fairness metrics such as demographic parity difference, equalized odds difference, equal opportunity difference etc. We find that models that incorporate multi-source data from demographic, clinical, genetic factors and cognitive scores are also unbiased. These models have better predictive AUC across subgroups than those trained only with imaging features but there are also situations when these additional features do not help.
    Revisiting Hyperparameter Tuning with Differential Privacy. (arXiv:2211.01852v2 [cs.LG] UPDATED)
    Hyperparameter tuning is a common practice in the application of machine learning but is a typically ignored aspect in the literature on privacy-preserving machine learning due to its negative effect on the overall privacy parameter. In this paper, we aim to tackle this fundamental yet challenging problem by providing an effective hyperparameter tuning framework with differential privacy. The proposed method allows us to adopt a broader hyperparameter search space and even to perform a grid search over the whole space, since its privacy loss parameter is independent of the number of hyperparameter candidates. Interestingly, it instead correlates with the utility gained from hyperparameter searching, revealing an explicit and mandatory trade-off between privacy and utility. Theoretically, we show that its additional privacy loss bound incurred by hyperparameter tuning is upper-bounded by the squared root of the gained utility. However, we note that the additional privacy loss bound would empirically scale like a squared root of the logarithm of the utility term, benefiting from the design of doubling step.
    Optimal Solutions for Joint Beamforming and Antenna Selection: From Branch and Bound to Graph Neural Imitation Learning. (arXiv:2206.05576v2 [eess.SP] UPDATED)
    This work revisits the joint beamforming (BF) and antenna selection (AS) problem, as well as its robust beamforming (RBF) version under imperfect channel state information (CSI). Such problems arise due to various reasons, e.g., the costly nature of the radio frequency (RF) chains and energy/resource-saving considerations. The joint (R)BF\&AS problem is a mixed integer and nonlinear program, and thus finding {\it optimal solutions} is often costly, if not outright impossible. The vast majority of the prior works tackled these problems using techniques such as continuous approximations, greedy methods, and supervised machine learning -- yet these approaches do not ensure optimality or even feasibility of the solutions. The main contribution of this work is threefold. First, an effective {\it branch and bound} (B\&B) framework for solving the problems of interest is proposed. Leveraging existing BF and RBF solvers, it is shown that the B\&B framework guarantees global optimality of the considered problems. Second, to expedite the potentially costly B\&B algorithm, a machine learning (ML)-based scheme is proposed to help skip intermediate states of the B\&B search tree. The learning model features a {\it graph neural network} (GNN)-based design that is resilient to a commonly encountered challenge in wireless communications, namely, the change of problem size (e.g., the number of users) across the training and test stages. Third, comprehensive performance characterizations are presented, showing that the GNN-based method retains the global optimality of B\&B with provably reduced complexity, under reasonable conditions. Numerical simulations also show that the ML-based acceleration can often achieve an order-of-magnitude speedup relative to B\&B.
    Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization. (arXiv:2212.13556v2 [cs.LG] UPDATED)
    To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
    Learning from many trajectories. (arXiv:2203.17193v2 [cs.LG] UPDATED)
    We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
    Strategyproof Decision-Making in Panel Data Settings and Beyond. (arXiv:2211.14236v2 [econ.EM] UPDATED)
    We propose a framework for decision-making in the presence of strategic agents with panel data, a standard setting in econometrics and statistics where one gets noisy, repeated measurements of multiple units. We consider a setup where there is a pre-intervention period, when the principal observes the outcomes of each unit, after which the principal uses these observations to assign a treatment to each unit. Our model can be thought of as a generalization of the synthetic controls and synthetic interventions frameworks, where units (or agents) may strategically manipulate pre-intervention outcomes to receive a more desirable intervention. We identify necessary and sufficient conditions under which a strategyproof mechanism that assigns interventions in the post-intervention period exists. Under a latent factor model assumption, we show that whenever a strategyproof mechanism exists, there is one with a simple closed form. In the setting where there is a single treatment and control (i.e., no other interventions), we establish that there is always a strategyproof mechanism, and provide an algorithm for learning such a mechanism. For the setting of multiple interventions, we provide an algorithm for learning a strategyproof mechanism, if there exists a sufficiently large gap in rewards between the different interventions. Finally, we empirically evaluate our model using real-world panel data collected from product sales over 18 months. We find that our methods compare favorably to baselines which do not take strategic interactions into consideration -- even in the presence of model misspecification. Along the way, we prove impossibility results for multi-class strategic classification, which may be of independent interest.
    Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. (arXiv:2207.02016v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is recognized as lacking generalization and robustness under environmental perturbations, which excessively restricts its application for real-world robotics. Prior work claimed that adding regularization to the value function is equivalent to learning a robust policy with uncertain transitions. Although the regularization-robustness transformation is appealing for its simplicity and efficiency, it is still lacking in continuous control tasks. In this paper, we propose a new regularizer named $\textbf{U}$ncertainty $\textbf{S}$et $\textbf{R}$egularizer (USR), by formulating the uncertainty set on the parameter space of the transition function. In particular, USR is flexible enough to be plugged into any existing RL framework. To deal with unknown uncertainty sets, we further propose a novel adversarial approach to generate them based on the value function. We evaluate USR on the Real-world Reinforcement Learning (RWRL) benchmark, demonstrating improvements in the robust performance for perturbed testing environments.
    Can we achieve robustness from data alone?. (arXiv:2207.11727v2 [cs.LG] UPDATED)
    We introduce a meta-learning algorithm for adversarially robust classification. The proposed method tries to be as model agnostic as possible and optimizes a dataset prior to its deployment in a machine learning system, aiming to effectively erase its non-robust features. Once the dataset has been created, in principle no specialized algorithm (besides standard gradient descent) is needed to train a robust model. We formulate the data optimization procedure as a bi-level optimization problem on kernel regression, with a class of kernels that describe infinitely wide neural nets (Neural Tangent Kernels). We present extensive experiments on standard computer vision benchmarks using a variety of different models, demonstrating the effectiveness of our method, while also pointing out its current shortcomings. In parallel, we revisit prior work that also focused on the problem of data optimization for robust classification \citep{Ily+19}, and show that being robust to adversarial attacks after standard (gradient descent) training on a suitable dataset is more challenging than previously thought.
    Meta-Learning via Classifier(-free) Diffusion Guidance. (arXiv:2210.08942v2 [cs.LG] UPDATED)
    We introduce meta-learning algorithms that perform zero-shot weight-space adaptation of neural network models to unseen tasks. Our methods repurpose the popular generative image synthesis techniques of natural language guidance and diffusion models to generate neural network weights adapted for tasks. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing multi-task and meta-learning methods in a series of zero-shot learning experiments on our Meta-VQA dataset.
    Fine-tuning or top-tuning? Transfer learning with pretrained features and fast kernel methods. (arXiv:2209.07932v2 [cs.LG] UPDATED)
    The impressive performances of deep learning architectures is associated to massive increase of models complexity. Millions of parameters need be tuned, with training and inference time scaling accordingly. But is massive fine-tuning necessary? In this paper, focusing on image classification, we consider a simple transfer learning approach exploiting pretrained convolutional features as input for a fast kernel method. We refer to this approach as top-tuning, since only the kernel classifier is trained. By performing more than 2500 training processes we show that this top-tuning approach provides comparable accuracy w.r.t. fine-tuning, with a training time that is between one and two orders of magnitude smaller. These results suggest that top-tuning provides a useful alternative to fine-tuning in small/medium datasets, especially when training efficiency is crucial.
    Py-Feat: Python Facial Expression Analysis Toolbox. (arXiv:2104.03509v3 [cs.CV] UPDATED)
    Studying facial expressions is a notoriously difficult endeavor. Recent advances in the field of affective computing have yielded impressive progress in automatically detecting facial expressions from pictures and videos. However, much of this work has yet to be widely disseminated in social science domains such as psychology. Current state of the art models require considerable domain expertise that is not traditionally incorporated into social science training programs. Furthermore, there is a notable absence of user-friendly and open-source software that provides a comprehensive set of tools and functions that support facial expression research. In this paper, we introduce Py-Feat, an open-source Python toolbox that provides support for detecting, preprocessing, analyzing, and visualizing facial expression data. Py-Feat makes it easy for domain experts to disseminate and benchmark computer vision models and also for end users to quickly process, analyze, and visualize face expression data. We hope this platform will facilitate increased use of facial expression data in human behavior research.
    A Dynamic Programming Algorithm for Finding an Optimal Sequence of Informative Measurements. (arXiv:2109.11808v4 [cs.LG] UPDATED)
    An informative measurement is the most efficient way to gain information about an unknown state. We present a first-principles derivation of a general-purpose dynamic programming algorithm that returns an optimal sequence of informative measurements by sequentially maximizing the entropy of possible measurement outcomes. This algorithm can be used by an autonomous agent or robot to decide where best to measure next, planning a path corresponding to an optimal sequence of informative measurements. The algorithm is applicable to states and controls that are either continuous or discrete, and agent dynamics that is either stochastic or deterministic; including Markov decision processes and Gaussian processes. Recent results from the fields of approximate dynamic programming and reinforcement learning, including on-line approximations such as rollout and Monte Carlo tree search, allow the measurement task to be solved in real time. The resulting solutions include non-myopic paths and measurement sequences that can generally outperform, sometimes substantially, commonly used greedy approaches. This is demonstrated for a global search task, where on-line planning for a sequence of local searches is found to reduce the number of measurements in the search by approximately half. A variant of the algorithm is derived for Gaussian processes for active sensing.
    PINCH: An Adversarial Extraction Attack Framework for Deep Learning Models. (arXiv:2209.06300v2 [cs.CR] UPDATED)
    Adversarial extraction attacks constitute an insidious threat against Deep Learning (DL) models in-which an adversary aims to steal the architecture, parameters, and hyper-parameters of a targeted DL model. Existing extraction attack literature have observed varying levels of attack success for different DL models and datasets, yet the underlying cause(s) behind their susceptibility often remain unclear, and would help facilitate creating secure DL systems. In this paper we present PINCH: an efficient and automated extraction attack framework capable of designing, deploying, and analyzing extraction attack scenarios across heterogeneous hardware platforms. Using PINCH, we perform extensive experimental evaluation of extraction attacks against 21 model architectures to explore new extraction attack scenarios and further attack staging. Our findings show (1) key extraction characteristics whereby particular model configurations exhibit strong resilience against specific attacks, (2) even partial extraction success enables further staging for other adversarial attacks, and (3) equivalent stolen models uncover differences in expressive power, yet exhibit similar captured knowledge.
    Benchmarking Large Language Models for News Summarization. (arXiv:2301.13848v1 [cs.CL])
    Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
    Reinforcement learning and decision making via single-photon quantum walks. (arXiv:2301.13669v1 [quant-ph])
    Variational quantum algorithms represent a promising approach to quantum machine learning where classical neural networks are replaced by parametrized quantum circuits. Here, we present a variational approach to quantize projective simulation (PS), a reinforcement learning model aimed at interpretable artificial intelligence. Decision making in PS is modeled as a random walk on a graph describing the agent's memory. To implement the quantized model, we consider quantum walks of single photons in a lattice of tunable Mach-Zehnder interferometers. We propose variational algorithms tailored to reinforcement learning tasks, and we show, using an example from transfer learning, that the quantized PS learning model can outperform its classical counterpart. Finally, we discuss the role of quantum interference for training and decision making, paving the way for realizations of interpretable quantum learning agents.
    What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness?. (arXiv:2210.05577v2 [cs.LG] UPDATED)
    The adversarial vulnerability of neural nets, and subsequent techniques to create robust models have attracted significant attention; yet we still lack a full understanding of this phenomenon. Here, we study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods, namely the Neural Tangent Kernel (NTK), following a growing body of work that leverages the NTK approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. We show how NTKs allow to generate adversarial examples in a ``training-free'' fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the ``lazy'' regime. We leverage this connection to provide an alternative view on robust and non-robust features, which have been suggested to underlie the adversarial brittleness of neural nets. Specifically, we define and study features induced by the eigendecomposition of the kernel to better understand the role of robust and non-robust features, the reliance on both for standard classification and the robustness-accuracy trade-off. We find that such features are surprisingly consistent across architectures, and that robust features tend to correspond to the largest eigenvalues of the model, and thus are learned early during training. Our framework allows us to identify and visualize non-robust yet useful features. Finally, we shed light on the robustness mechanism underlying adversarial training of neural nets used in practice: quantifying the evolution of the associated empirical NTK, we demonstrate that its dynamics falls much earlier into the ``lazy'' regime and manifests a much stronger form of the well known bias to prioritize learning features within the top eigenspaces of the kernel, compared to standard training.
    Bayesian Learning for Neural Networks: an algorithmic survey. (arXiv:2211.11865v4 [stat.ML] UPDATED)
    The last decade witnessed a growing interest in Bayesian learning. Yet, the technicality of the topic and the multitude of ingredients involved therein, besides the complexity of turning theory into practical implementations, limit the use of the Bayesian learning paradigm, preventing its widespread adoption across different fields and applications. This self-contained survey engages and introduces readers to the principles and algorithms of Bayesian Learning for Neural Networks. It provides an introduction to the topic from an accessible, practical-algorithmic perspective. Upon providing a general introduction to Bayesian Neural Networks, we discuss and present both standard and recent approaches for Bayesian inference, with an emphasis on solutions relying on Variational Inference and the use of Natural gradients. We also discuss the use of manifold optimization as a state-of-the-art approach to Bayesian learning. We examine the characteristic properties of all the discussed methods, and provide pseudo-codes for their implementation, paying attention to practical aspects, such as the computation of the gradients.
    Cutting Plane Selection with Analytic Centers and Multiregression. (arXiv:2212.07231v3 [math.OC] UPDATED)
    Cutting planes are a crucial component of state-of-the-art mixed-integer programming solvers, with the choice of which subset of cuts to add being vital for solver performance. We propose new distance-based measures to qualify the value of a cut by quantifying the extent to which it separates relevant parts of the relaxed feasible set. For this purpose, we use the analytic centers of the relaxation polytope or of its optimal face, as well as alternative optimal solutions of the linear programming relaxation. We assess the impact of the choice of distance measure on root node performance and throughout the whole branch-and-bound tree, comparing our measures against those prevalent in the literature. Finally, by a multi-output regression, we predict the relative performance of each measure, using static features readily available before the separation process. Our results indicate that analytic center-based methods help to significantly reduce the number of branch-and-bound nodes needed to explore the search space and that our multiregression approach can further improve on any individual method.
    Direct-Effect Risk Minimization for Domain Generalization. (arXiv:2211.14594v2 [cs.LG] UPDATED)
    We study the problem of out-of-distribution (o.o.d.) generalization where spurious correlations of attributes vary across training and test domains. This is known as the problem of correlation shift and has posed concerns on the reliability of machine learning. In this work, we introduce the concepts of direct and indirect effects from causal inference to the domain generalization problem. We argue that models that learn direct effects minimize the worst-case risk across correlation-shifted domains. To eliminate the indirect effects, our algorithm consists of two stages: in the first stage, we learn an indirect-effect representation by minimizing the prediction error of domain labels using the representation and the class label; in the second stage, we remove the indirect effects learned in the first stage by matching each data with another data of similar indirect-effect representation but of different class label. We also propose a new model selection method by matching the validation set in the same way, which is shown to improve the generalization performance of existing models on correlation-shifted datasets. Experiments on 5 correlation-shifted datasets and the DomainBed benchmark verify the effectiveness of our approach.
    Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation. (arXiv:2212.09412v2 [cs.CL] UPDATED)
    Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies and analyze the challenges between the continuous data space and the embedding space which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embeddings varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find the normal level of noise causes insufficient training of the model. To address the above challenges, we propose Difformer, an embedding diffusion model based on Transformer, which consists of three essential modules including an additional anchor loss function, a layer normalization module for embeddings, and a noise factor to the Gaussian noise. Experiments on two seminal text generation tasks including machine translation and text summarization show the superiority of Difformer over compared embedding diffusion baselines.
    Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences. (arXiv:2210.11794v2 [cs.LG] UPDATED)
    Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.
    Forecasting COVID- 19 cases using Statistical Models and Ontology-based Semantic Modelling: A real time data analytics approach. (arXiv:2206.02795v2 [q-bio.PE] UPDATED)
    SARS-COV-19 is the most prominent issue which many countries face today. The frequent changes in infections, recovered and deaths represents the dynamic nature of this pandemic. It is very crucial to predict the spreading rate of this virus for accurate decision making against fighting with the situation of getting infected through the virus, tracking and controlling the virus transmission in the community. We develop a prediction model using statistical time series models such as SARIMA and FBProphet to monitor the daily active, recovered and death cases of COVID-19 accurately. Then with the help of various details across each individual patient (like height, weight, gender etc.), we designed a set of rules using Semantic Web Rule Language and some mathematical models for dealing with COVID19 infected cases on an individual basis. After combining all the models, a COVID-19 Ontology is developed and performs various queries using SPARQL query on designed Ontology which accumulate the risk factors, provide appropriate diagnosis, precautions and preventive suggestions for COVID Patients. After comparing the performance of SARIMA and FBProphet, it is observed that the SARIMA model performs better in forecasting of COVID cases. On individual basis COVID case prediction, approx. 497 individual samples have been tested and classified into five different levels of COVID classes such as Having COVID, No COVID, High Risk COVID case, Medium to High Risk case, and Control needed case.
    Recipro-CAM: Gradient-free reciprocal class activation map. (arXiv:2209.14074v2 [cs.CV] UPDATED)
    Convolutional neural network (CNN) becomes one of the most popular and prominent deep learning architectures for computer vision, but its black box feature hides the internal prediction process. For this reason, AI practitioners have shed light on explainable AI to provide the interpretability of the model behavior. In particular, class activation map (CAM) and Grad-CAM based methods have shown promise results, but they have architectural limitation or gradient computing burden. To resolve these, Score-CAM has been suggested as a gradient-free method, however, it requires more execution time compared to CAM or Grad-CAM based methods. Therefore, we propose a lightweight architecture and gradient free Reciprocal CAM (Recipro-CAM) by spatially masking the extracted feature maps to exploit the correlation between activation maps and network outputs. With the proposed method, we achieved the gains of 1.78 - 3.72% in the ResNet family compared to Score-CAM in Average Drop- Coherence-Complexity (ADCC) metric, excluding the VGG-16 (1.39% drop). In addition, Recipro-CAM exhibits a saliency map generation rate similar to Grad-CAM and approximately 148 times faster than Score-CAM. The source code of Recipro-CAM is available at our data analysis framework.
    Discovery of Single Independent Latent Variable. (arXiv:2110.05887v2 [stat.ML] UPDATED)
    Latent variable discovery is a central problem in data analysis with a broad range of applications in applied science. In this work, we consider data given as an invertible mixture of two statistically independent components, and assume that one of the components is observed while the other is hidden. Our goal is to recover the hidden component. For this purpose, we propose an autoencoder equipped with a discriminator. Unlike the standard nonlinear ICA problem, which was shown to be non-identifiable, in the special case of ICA we consider here, we show that our approach can recover the component of interest up to entropy-preserving transformation. We demonstrate the performance of the proposed approach on several datasets, including image synthesis, voice cloning, and fetal ECG extraction.
    Learning in POMDPs is Sample-Efficient with Hindsight Observability. (arXiv:2301.13857v1 [cs.LG])
    POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
    Policy Gradient for s-Rectangular Robust Markov Decision Processes. (arXiv:2301.13589v1 [cs.LG])
    We present a novel robust policy gradient method (RPG) for s-rectangular robust Markov Decision Processes (MDPs). We are the first to derive the adversarial kernel in a closed form and demonstrate that it is a one-rank perturbation of the nominal kernel. This allows us to derive an RPG that is similar to the one used in non-robust MDPs, except with a robust Q-value function and an additional correction term. Both robust Q-values and correction terms are efficiently computable, thus the time complexity of our method matches that of non-robust MDPs, which is significantly faster compared to existing black box methods.
    Causal Graph Discovery from Self and Mutually Exciting Time Series. (arXiv:2106.02600v3 [cs.LG] UPDATED)
    We present a generalized linear structural causal model, coupled with a novel data-adaptive linear regularization, to recover causal directed acyclic graphs (DAGs) from time series. By leveraging a recently developed stochastic monotone Variational Inequality (VI) formulation, we cast the causal discovery problem as a general convex optimization. Furthermore, we develop a non-asymptotic recovery guarantee and quantifiable uncertainty by solving a linear program to establish confidence intervals for a wide range of non-linear monotone link functions. We validate our theoretical results and show the competitive performance of our method via extensive numerical experiments. Most importantly, we demonstrate the effectiveness of our approach in recovering highly interpretable causal DAGs over Sepsis Associated Derangements (SADs) while achieving comparable prediction performance to powerful ``black-box'' models such as XGBoost. Thus, the future adoption of our proposed method to conduct continuous surveillance of high-risk patients by clinicians is much more likely.
    Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks. (arXiv:2301.13799v1 [cs.LG])
    From natural language processing to genome sequencing, large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and instead must be distributed across multiple devices. This has motivated the research of new compute and network systems capable of handling such tasks. In particular, recent work has focused on developing management schemes which decide how to allocate distributed resources such that some overall objective, such as minimising the job completion time (JCT), is optimised. However, such studies omit explicit consideration of how much a job should be distributed, usually assuming that maximum distribution is desirable. In this work, we show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate. To address this, we propose PAC-ML (partitioning for asynchronous computing with machine learning). PAC-ML leverages a graph neural network and reinforcement learning to learn how much to partition computation graphs such that the number of jobs which meet arbitrary user-defined JCT requirements is maximised. In experiments with five real deep learning computation graphs on a recently proposed optical architecture across four user-defined JCT requirement distributions, we demonstrate PAC-ML achieving up to 56.2% lower blocking rates in dynamic job arrival settings than the canonical maximum parallelisation strategy used by most prior works.
    An Empirical Study of Quantum Dynamics as a Ground State Problem with Neural Quantum States. (arXiv:2206.09241v2 [quant-ph] UPDATED)
    We consider the Feynman-Kitaev formalism applied to a spin chain described by the transverse field Ising model. This formalism consists of building a Hamiltonian whose ground state encodes the time evolution of the spin chain at discrete time steps. To find this ground state, variational wave functions parameterised by artificial neural networks -- also known as neural quantum states (NQSs) -- are used. Our work focuses on assessing, in the context of the Feynman-Kitaev formalism, two properties of NQSs: expressivity (the possibility that variational parameters can be set to values such that the NQS is faithful to the true ground state of the system) and trainability (the process of reaching said values). We find that the considered NQSs are capable of accurately approximating the true ground state of the system, i.e., they are expressive enough ans\"atze. However, extensive hyperparameter tuning experiments show that, empirically, reaching the set of values for the variational parameters that correctly describe the ground state becomes ever more difficult as the number of time steps increase because the true ground state becomes more entangled, and the probability distribution starts to spread across the Hilbert space canonical basis.
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v2 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important model class in spatial statistics. They encode prior information about the modeled function and can be used for exact or approximate Bayesian inference. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
    Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles. (arXiv:2205.14116v2 [cs.LG] UPDATED)
    Counterfactual explanations describe how to modify a feature vector in order to flip the outcome of a trained classifier. Obtaining robust counterfactual explanations is essential to provide valid algorithmic recourse and meaningful explanations. We study the robustness of explanations of randomized ensembles, which are always subject to algorithmic uncertainty even when the training data is fixed. We formalize the generation of robust counterfactual explanations as a probabilistic problem and show the link between the robustness of ensemble models and the robustness of base learners. We develop a practical method with good empirical performance and support it with theoretical guarantees for ensembles of convex base learners. Our results show that existing methods give surprisingly low robustness: the validity of naive counterfactuals is below $50\%$ on most data sets and can fall to $20\%$ on problems with many features. In contrast, our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
    PADL: Language-Directed Physics-Based Character Control. (arXiv:2301.13868v1 [cs.LG])
    Developing systems that can synthesize natural and life-like motions for simulated characters has long been a focus for computer animation. But in order for these systems to be useful for downstream applications, they need not only produce high-quality motions, but must also provide an accessible and versatile interface through which users can direct a character's behaviors. Natural language provides a simple-to-use and expressive medium for specifying a user's intent. Recent breakthroughs in natural language processing (NLP) have demonstrated effective use of language-based interfaces for applications such as image generation and program synthesis. In this work, we present PADL, which leverages recent innovations in NLP in order to take steps towards developing language-directed controllers for physics-based character animation. PADL allows users to issue natural language commands for specifying both high-level tasks and low-level skills that a character should perform. We present an adversarial imitation learning approach for training policies to map high-level language commands to low-level controls that enable a character to perform the desired task and skill specified by a user's commands. Furthermore, we propose a multi-task aggregation method that leverages a language-based multiple-choice question-answering approach to determine high-level task objectives from language commands. We show that our framework can be applied to effectively direct a simulated humanoid character to perform a diverse array of complex motor skills.
    Generating Synthetic Mixed-type Longitudinal Electronic Health Records for Artificial Intelligent Applications. (arXiv:2112.12047v2 [cs.LG] UPDATED)
    The recent availability of electronic health records (EHRs) have provided enormous opportunities to develop artificial intelligence (AI) algorithms. However, patient privacy has become a major concern that limits data sharing across hospital settings and subsequently hinders the advances in AI. Synthetic data, which benefits from the development and proliferation of generative models, has served as a promising substitute for real patient EHR data. However, the current generative models are limited as they only generate single type of clinical data for a synthetic patient, i.e., either continuous-valued or discrete-valued. To mimic the nature of clinical decision-making which encompasses various data types/sources, in this study, we propose a generative adversarial network (GAN) entitled EHR-M-GAN which simultaneously synthesizes mixed-type timeseries EHR data. EHR-M-GAN is capable of capturing the multidimensional, heterogeneous, and correlated temporal dynamics in patient trajectories. We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients, and performed privacy risk evaluation of the proposed model. EHR-M-GAN has demonstrated its superiority over state-of-the-art benchmarks for synthesizing clinical timeseries with high fidelity, while addressing the limitations regarding data types and dimensionality in the current generative models. Notably, prediction models for outcomes of intensive care performed significantly better when training data was augmented with the addition of EHR-M-GAN-generated timeseries. EHR-M-GAN may have use in developing AI algorithms in resource-limited settings, lowering the barrier for data acquisition while preserving patient privacy.
    Disentangling Model Multiplicity in Deep Learning. (arXiv:2206.08890v2 [cs.LG] UPDATED)
    Model multiplicity is a well-known but poorly understood phenomenon that undermines the generalisation guarantees of machine learning models. It appears when two models with similar training-time performance differ in their predictions and real-world performance characteristics. This observed 'predictive' multiplicity (PM) also implies elusive differences in the internals of the models, their 'representational' multiplicity (RM). We introduce a conceptual and experimental setup for analysing RM by measuring activation similarity via singular vector canonical correlation analysis (SVCCA). We show that certain differences in training methods systematically result in larger RM than others and evaluate RM and PM over a finite sample as predictors for generalizability. We further correlate RM with PM measured by the variance in i.i.d. and out-of-distribution test predictions in four standard image data sets. Finally, instead of attempting to eliminate RM, we call for its systematic measurement and maximal exposure.
    FedBC: Calibrating Global and Local Models via Federated Learning Beyond Consensus. (arXiv:2206.10815v3 [cs.LG] UPDATED)
    In this work, we quantitatively calibrate the performance of global and local models in federated learning through a multi-criterion optimization-based framework, which we cast as a constrained program. The objective of a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considering the Lagrangian relaxation of this problem, we develop a novel primal-dual method called Federated Learning Beyond Consensus (\texttt{FedBC}). Theoretically, we establish that \texttt{FedBC} converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on a tolerance parameter introduced to scalarize the multi-criterion formulation. Finally, we demonstrate that \texttt{FedBC} balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with state-of-the-art.
    Toward Efficient Gradient-Based Value Estimation. (arXiv:2301.13757v1 [cs.LG])
    Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
    Salient Conditional Diffusion for Defending Against Backdoor Attacks. (arXiv:2301.13862v1 [cs.LG])
    We propose a novel algorithm, Salient Conditional Diffusion (Sancdifi), a state-of-the-art defense against backdoor attacks. Sancdifi uses a denoising diffusion probabilistic model (DDPM) to degrade an image with noise and then recover said image using the learned reverse diffusion. Critically, we compute saliency map-based masks to condition our diffusion, allowing for stronger diffusion on the most salient pixels by the DDPM. As a result, Sancdifi is highly effective at diffusing out triggers in data poisoned by backdoor attacks. At the same time, it reliably recovers salient features when applied to clean data. This performance is achieved without requiring access to the model parameters of the Trojan network, meaning Sancdifi operates as a black-box defense.
    FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning. (arXiv:2205.07246v3 [cs.LG] UPDATED)
    Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization. However, we argue that existing methods might fail to utilize the unlabeled data more effectively since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to obtain intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty to encourage the model for diverse predictions during the early training stage. Extensive experiments indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. Moreover, FreeMatch can also boost the performance of imbalanced SSL. The codes can be found at https://github.com/microsoft/Semi-supervised-learning.
    Normalizing Flows for Interventional Density Estimation. (arXiv:2209.06203v3 [cs.LG] UPDATED)
    Existing machine learning methods for causal inference usually estimate quantities expressed via the mean of potential outcomes (e.g., average treatment effect). However, such quantities do not capture the full information about the distribution of potential outcomes. In this work, we estimate the density of potential outcomes after interventions from observational data. For this, we propose a novel, fully-parametric deep learning method called Interventional Normalizing Flows. Specifically, we combine two normalizing flows, namely (i) a teacher flow for estimating nuisance parameters and (ii) a student flow for a parametric estimation of the density of potential outcomes. We further develop a tractable optimization objective based on a one-step bias correction for an efficient and doubly robust estimation of the student flow parameters. As a result our Interventional Normalizing Flows offer a properly normalized density estimator. Across various experiments, we demonstrate that our Interventional Normalizing Flows are expressive and highly effective, and scale well with both sample size and high-dimensional confounding. To the best of our knowledge, our Interventional Normalizing Flows are the first fully-parametric, deep learning method for density estimation of potential outcomes.
    A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser. (arXiv:2301.13731v1 [stat.ML])
    This paper presents a new convergent Plug-and-Play (PnP) algorithm. PnP methods are efficient iterative algorithms for solving image inverse problems formulated as the minimization of the sum of a data-fidelity term and a regularization term. PnP methods perform regularization by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD). To ensure convergence of PnP schemes, many works study specific parametrizations of deep denoisers. However, existing results require either unverifiable or suboptimal hypotheses on the denoiser, or assume restrictive conditions on the parameters of the inverse problem. Observing that these limitations can be due to the proximal algorithm in use, we study a relaxed version of the PGD algorithm for minimizing the sum of a convex function and a weakly convex one. When plugged with a relaxed proximal denoiser, we show that the proposed PnP-$\alpha$PGD algorithm converges for a wider range of regularization parameters, thus allowing more accurate image restoration.
    OPT-GAN: A Broad-Spectrum Global Optimizer for Black-box Problems by Learning Distribution. (arXiv:2102.03888v6 [cs.LG] UPDATED)
    Black-box optimization (BBO) algorithms are concerned with finding the best solutions for problems with missing analytical details. Most classical methods for such problems are based on strong and fixed a priori assumptions, such as Gaussianity. However, the complex real-world problems, especially when the global optimum is desired, could be very far from the a priori assumptions because of their diversities, causing unexpected obstacles. In this study, we propose a generative adversarial net-based broad-spectrum global optimizer (OPT-GAN) which estimates the distribution of optimum gradually, with strategies to balance exploration-exploitation trade-off. It has potential to better adapt to the regularity and structure of diversified landscapes than other methods with fixed prior, e.g., Gaussian assumption or separability. Experiments on diverse BBO benchmarks and high dimensional real world applications exhibit that OPT-GAN outperforms other traditional and neural net-based BBO algorithms.
    Subgroup Fairness in Two-Sided Markets. (arXiv:2106.02702v2 [cs.AI] UPDATED)
    It is well known that two-sided markets are unfair in a number of ways. For instance, female workers at Uber earn less than their male colleagues per mile driven. Similar observations have been made for other minority subgroups in other two-sided markets. Here, we suggest a novel market-clearing mechanism for two-sided markets, which promotes equalisation of the pay per hour worked across multiple subgroups, as well as within each subgroup. In the process, we introduce a novel notion of subgroup fairness (which we call Inter-fairness), which can be combined with other notions of fairness within each subgroup (called Intra-fairness), and the utility for the customers (Customer-Care) in the objective of the market-clearing problem. While the novel non-linear terms in the objective complicate market clearing by making the problem non-convex, we show that a certain non-convex augmented Lagrangian relaxation can be approximated to any precision in time polynomial in the number of market participants using semi-definite programming. This makes it possible to implement the market-clearing mechanism efficiently. On the example of driver-ride assignment in an Uber-like system, we demonstrate the efficacy and scalability of the approach, and trade-offs between Inter- and Intra-fairness.
    Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors. (arXiv:2201.06463v4 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference, Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs describe (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.
    SecGNN: Privacy-Preserving Graph Neural Network Training and Inference as a Cloud Service. (arXiv:2202.07835v2 [cs.CR] UPDATED)
    Graphs are widely used to model the complex relationships among entities. As a powerful tool for graph analytics, graph neural networks (GNNs) have recently gained wide attention due to its end-to-end processing capabilities. With the proliferation of cloud computing, it is increasingly popular to deploy the services of complex and resource-intensive model training and inference in the cloud due to its prominent benefits. However, GNN training and inference services, if deployed in the cloud, will raise critical privacy concerns about the information-rich and proprietary graph data (and the resulting model). While there has been some work on secure neural network training and inference, they all focus on convolutional neural networks handling images and text rather than complex graph data with rich structural information. In this paper, we design, implement, and evaluate SecGNN, the first system supporting privacy-preserving GNN training and inference services in the cloud. SecGNN is built from a synergy of insights on lightweight cryptography and machine learning techniques. We deeply examine the procedure of GNN training and inference, and devise a series of corresponding secure customized protocols to support the holistic computation. Extensive experiments demonstrate that SecGNN achieves comparable plaintext training and inference accuracy, with promising performance.
    Optimal precision for GANs. (arXiv:2207.10541v2 [cs.LG] UPDATED)
    Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnected distributions. We investigate the relationship between the performance of these models and the geometry of their latent space. Building on recent developments in geometric measure theory, we prove a sufficient condition for optimality in the case where the dimension of the latent space is larger than the number of modes. Through experiments on GANs, we demonstrate the validity of our theoretical results and gain new insights into the latent space geometry of these models. Additionally, we propose a truncation method that enforces a simplicial cluster structure in the latent space and improves the performance of GANs.
    Fairness and Accuracy under Domain Generalization. (arXiv:2301.13323v1 [cs.LG])
    As machine learning (ML) algorithms are increasingly used in high-stakes applications, concerns have arisen that they may be biased against certain social groups. Although many approaches have been proposed to make ML models fair, they typically rely on the assumption that data distributions in training and deployment are identical. Unfortunately, this is commonly violated in practice and a model that is fair during training may lead to an unexpected outcome during its deployment. Although the problem of designing robust ML models under dataset shifts has been widely studied, most existing works focus only on the transfer of accuracy. In this paper, we study the transfer of both fairness and accuracy under domain generalization where the data at test time may be sampled from never-before-seen domains. We first develop theoretical bounds on the unfairness and expected loss at deployment, and then derive sufficient conditions under which fairness and accuracy can be perfectly transferred via invariant representation learning. Guided by this, we design a learning algorithm such that fair ML models learned with training data still have high fairness and accuracy when deployment environments change. Experiments on real-world data validate the proposed algorithm. Model implementation is available at https://github.com/pth1993/FATDM.  ( 2 min )
    Transport with Support: Data-Conditional Diffusion Bridges. (arXiv:2301.13636v1 [cs.LG])
    The dynamic Schr\"odinger bridge problem provides an appealing setting for solving optimal transport problems by learning non-linear diffusion processes using efficient iterative solvers. Recent works have demonstrated state-of-the-art results (eg. in modelling single-cell embryo RNA sequences or sampling from complex posteriors) but are limited to learning bridges with only initial and terminal constraints. Our work extends this paradigm by proposing the Iterative Smoothing Bridge (ISB). We integrate Bayesian filtering and optimal control into learning the diffusion process, enabling constrained stochastic processes governed by sparse observations at intermediate stages and terminal constraints. We assess the effectiveness of our method on synthetic and real-world data and show that the ISB generalises well to high-dimensional data, is computationally efficient, and provides accurate estimates of the marginals at intermediate and terminal times.
    Structure Learning and Parameter Estimation for Graphical Models via Penalized Maximum Likelihood Methods. (arXiv:2301.13269v1 [stat.ML])
    Probabilistic graphical models (PGMs) provide a compact and flexible framework to model very complex real-life phenomena. They combine the probability theory which deals with uncertainty and logical structure represented by a graph which allows one to cope with the computational complexity and also interpret and communicate the obtained knowledge. In the thesis, we consider two different types of PGMs: Bayesian networks (BNs) which are static, and continuous time Bayesian networks which, as the name suggests, have a temporal component. We are interested in recovering their true structure, which is the first step in learning any PGM. This is a challenging task, which is interesting in itself from the causal point of view, for the purposes of interpretation of the model and the decision-making process. All approaches for structure learning in the thesis are united by the same idea of maximum likelihood estimation with the LASSO penalty. The problem of structure learning is reduced to the problem of finding non-zero coefficients in the LASSO estimator for a generalized linear model. In the case of CTBNs, we consider the problem both for complete and incomplete data. We support the theoretical results with experiments.  ( 2 min )
    BRAIxDet: Learning to Detect Malignant Breast Lesion with Incomplete Annotations. (arXiv:2301.13418v1 [cs.CV])
    Methods to detect malignant lesions from screening mammograms are usually trained with fully annotated datasets, where images are labelled with the localisation and classification of cancerous lesions. However, real-world screening mammogram datasets commonly have a subset that is fully annotated and another subset that is weakly annotated with just the global classification (i.e., without lesion localisation). Given the large size of such datasets, researchers usually face a dilemma with the weakly annotated subset: to not use it or to fully annotate it. The first option will reduce detection accuracy because it does not use the whole dataset, and the second option is too expensive given that the annotation needs to be done by expert radiologists. In this paper, we propose a middle-ground solution for the dilemma, which is to formulate the training as a weakly- and semi-supervised learning problem that we refer to as malignant breast lesion detection with incomplete annotations. To address this problem, our new method comprises two stages, namely: 1) pre-training a multi-view mammogram classifier with weak supervision from the whole dataset, and 2) extending the trained classifier to become a multi-view detector that is trained with semi-supervised student-teacher learning, where the training set contains fully and weakly-annotated mammograms. We provide extensive detection results on two real-world screening mammogram datasets containing incomplete annotations, and show that our proposed approach achieves state-of-the-art results in the detection of malignant breast lesions with incomplete annotations.
    Towards Learned Emulation of Interannual Water Isotopologue Variations in General Circulation Models. (arXiv:2301.13462v1 [physics.ao-ph])
    Simulating abundances of stable water isotopologues, i.e. molecules differing in their isotopic composition, within climate models allows for comparisons with proxy data and, thus, for testing hypotheses about past climate and validating climate models under varying climatic conditions. However, many models are run without explicitly simulating water isotopologues. We investigate the possibility to replace the explicit physics-based simulation of oxygen isotopic composition in precipitation using machine learning methods. These methods estimate isotopic composition at each time step for given fields of surface temperature and precipitation amount. We implement convolutional neural networks (CNNs) based on the successful UNet architecture and test whether a spherical network architecture outperforms the naive approach of treating Earth's latitude-longitude grid as a flat image. Conducting a case study on a last millennium run with the iHadCM3 climate model, we find that roughly 40\% of the temporal variance in the isotopic composition is explained by the emulations on interannual and monthly timescale, with spatially varying emulation quality. A modified version of the standard UNet architecture for flat images yields results that are equally good as the predictions by the spherical CNN. We test generalization to last millennium runs of other climate models and find that while the tested deep learning methods yield the best results on iHadCM3 data, the performance drops when predicting on other models and is comparable to simple pixel-wise linear regression. An extended choice of predictor variables and improving the robustness of learned climate--oxygen isotope relationships should be explored in future work.
    STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. (arXiv:2207.05022v3 [cs.LG] UPDATED)
    Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays. To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy. Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines.
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v2 [stat.ML] UPDATED)
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings: valid inference after data peeking requires correcting for multiple testing but such corrections generally result in low power. Following the principle of testing by betting, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g, the Hilbert-Schmidt independence criterion. Our test is valid under non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.
    Automated speech- and text-based classification of neuropsychiatric conditions in a multidiagnostic setting. (arXiv:2301.06916v2 [cs.CL] UPDATED)
    Speech patterns have been identified as potential diagnostic markers for neuropsychiatric conditions. However, most studies only compare a single clinical group to healthy controls, whereas clinical practice often requires differentiating between multiple potential diagnoses (multiclass settings). To address this, we assembled a dataset of repeated recordings from 420 participants (67 with major depressive disorder, 106 with schizophrenia and 46 with autism, as well as matched controls), and tested the performance of a range of conventional machine learning models and advanced Transformer models on both binary and multiclass classification, based on voice and text features. While binary models performed comparably to previous research (F1 scores between 0.54-0.75 for autism spectrum disorder, ASD; 0.67-0.92 for major depressive disorder, MDD; and 0.71-0.83 for schizophrenia); when differentiating between multiple diagnostic groups performance decreased markedly (F1 scores between 0.35-0.44 for ASD, 0.57-0.75 for MDD, 0.15-0.66 for schizophrenia, and 0.38-0.52 macro F1). Combining voice and text-based models yielded increased performance, suggesting that they capture complementary diagnostic information. Our results indicate that models trained on binary classification may learn to rely on markers of generic differences between clinical and non-clinical populations, or markers of clinical features that overlap across conditions, rather than identifying markers specific to individual conditions. We provide recommendations for future research in the field, suggesting increased focus on developing larger transdiagnostic datasets that include more fine-grained clinical features, and that can support the development of models that better capture the complexity of neuropsychiatric conditions and naturalistic diagnostic assessment.
    Low Complexity Adaptive Machine Learning Approaches for End-to-End Latency Prediction. (arXiv:2301.13536v1 [cs.NI])
    Software Defined Networks have opened the door to statistical and AI-based techniques to improve efficiency of networking. Especially to ensure a certain Quality of Service (QoS) for specific applications by routing packets with awareness on content nature (VoIP, video, files, etc.) and its needs (latency, bandwidth, etc.) to use efficiently resources of a network. Monitoring and predicting various Key Performance Indicators (KPIs) at any level may handle such problems while preserving network bandwidth. The question addressed in this work is the design of efficient, low-cost adaptive algorithms for KPI estimation, monitoring and prediction. We focus on end-to-end latency prediction, for which we illustrate our approaches and results on data obtained from a public generator provided after the recent international challenge on GNN [12]. In this paper, we improve our previously proposed low-cost estimators [6] by adding the adaptive dimension, and show that the performances are minimally modified while gaining the ability to track varying networks.
    Inference Time Evidences of Adversarial Attacks for Forensic on Transformers. (arXiv:2301.13356v1 [cs.CV])
    Vision Transformers (ViTs) are becoming a very popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification. However, although early works implied that this network structure had increased robustness against adversarial attacks, some works argue ViTs are still vulnerable. This paper presents our first attempt toward detecting adversarial attacks during inference time using the network's input and outputs as well as latent features. We design four quantifications (or derivatives) of input, output, and latent vectors of ViT-based models that provide a signature of the inference, which could be beneficial for the attack detection, and empirically study their behavior over clean samples and adversarial samples. The results demonstrate that the quantifications from input (images) and output (posterior probabilities) are promising for distinguishing clean and adversarial samples, while latent vectors offer less discriminative power, though they give some insights on how adversarial perturbations work.
    Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees. (arXiv:2301.13375v1 [cs.LG])
    Robustness and safety are critical for the trustworthy deployment of deep reinforcement learning in real-world decision making applications. In particular, we require algorithms that can guarantee robust, safe performance in the presence of general environment disturbances, while making limited assumptions on the data collection process during training. In this work, we propose a safe reinforcement learning framework with robustness guarantees through the use of an optimal transport cost uncertainty set. We provide an efficient, theoretically supported implementation based on Optimal Transport Perturbations, which can be applied in a completely offline fashion using only data collected in a nominal training environment. We demonstrate the robust, safe performance of our approach on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v1 [cs.LG])
    Motivated by time series forecasting, we study Online Linear Optimization (OLO) under the coupling of two problem structures: the domain is unbounded, and the performance of an algorithm is measured by its dynamic regret. Handling either of them requires the regret bound to depend on certain complexity measure of the comparator sequence -- specifically, the comparator norm in unconstrained OLO, and the path length in dynamic regret. In contrast to a recent work (Jacobsen & Cutkosky, 2022) that adapts to the combination of these two complexity measures, we propose an alternative complexity measure by recasting the problem into sparse coding. Adaptivity can be achieved by a simple modular framework, which naturally exploits more intricate prior knowledge of the environment. Along the way, we also present a new gradient adaptive algorithm for static unconstrained OLO, designed using novel continuous time machinery. This could be of independent interest.
    Sequential Strategic Screening. (arXiv:2301.13397v1 [cs.LG])
    We initiate the study of strategic behavior in screening processes with multiple classifiers. We focus on two contrasting settings: a conjunctive setting in which an individual must satisfy all classifiers simultaneously, and a sequential setting in which an individual to succeed must satisfy classifiers one at a time. In other words, we introduce the combination of strategic classification with screening processes. We show that sequential screening pipelines exhibit new and surprising behavior where individuals can exploit the sequential ordering of the tests to zig-zag between classifiers without having to simultaneously satisfy all of them. We demonstrate an individual can obtain a positive outcome using a limited manipulation budget even when far from the intersection of the positive regions of every classifier. Finally, we consider a learner whose goal is to design a sequential screening process that is robust to such manipulations, and provide a construction for the learner that optimizes a natural objective.
    Learning Against Distributional Uncertainty: On the Trade-off Between Robustness and Specificity. (arXiv:2301.13565v1 [cs.LG])
    Trustworthy machine learning aims at combating distributional uncertainties in training data distributions compared to population distributions. Typical treatment frameworks include the Bayesian approach, (min-max) distributionally robust optimization (DRO), and regularization. However, two issues have to be raised: 1) All these methods are biased estimators of the true optimal cost; 2) the prior distribution in the Bayesian method, the radius of the distributional ball in the DRO method, and the regularizer in the regularization method are difficult to specify. This paper studies a new framework that unifies the three approaches and that addresses the two challenges mentioned above. The asymptotic properties (e.g., consistency and asymptotic normalities), non-asymptotic properties (e.g., unbiasedness and generalization error bound), and a Monte--Carlo-based solution method of the proposed model are studied. The new model reveals the trade-off between the robustness to the unseen data and the specificity to the training data.
    An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation. (arXiv:2301.13383v1 [cs.SD])
    Pitch and meter are two fundamental music features for symbolic music generation tasks, where researchers usually choose different encoding methods depending on specific goals. However, the advantages and drawbacks of different encoding methods have not been frequently discussed. This paper presents a integrated analysis of the influence of two low-level feature, pitch and meter, on the performance of a token-based sequential music generation model. First, the commonly used MIDI number encoding and a less used class-octave encoding are compared. Second, an dense intra-bar metric grid is imposed to the encoded sequence as auxiliary features. Different complexity and resolutions of the metric grid are compared. For complexity, the single token approach and the multiple token approach are compared; for grid resolution, 0 (ablation), 1 (bar-level), 4 (downbeat-level) 12, (8th-triplet-level) up to 64 (64th-note-grid-level) are compared; for duration resolution, 4, 8, 12 and 16 subdivisions per beat are compared. All different encodings are tested on separately trained Transformer-XL models for a melody generation task. Regarding distribution similarity of several objective evaluation metrics to the test dataset, results suggest that the class-octave encoding significantly outperforms the taken-for-granted MIDI encoding on pitch-related metrics; finer grids and multiple-token grids improve the rhythmic quality, but also suffer from over-fitting at early training stage. Results display a general phenomenon of over-fitting from two aspects, the pitch embedding space and the test loss of the single-token grid encoding. From a practical perspective, we both demonstrate the feasibility and raise the concern of easy over-fitting problem of using smaller networks and lower embedding dimensions on the generation task. The findings can also contribute to futural models in terms of feature engineering.
    Classified as unknown: A novel Bayesian neural network. (arXiv:2301.13401v1 [cs.LG])
    We establish estimations for the parameters of the output distribution for the softmax activation function using the probit function. As an application, we develop a new efficient Bayesian learning algorithm for fully connected neural networks, where training and predictions are performed within the Bayesian inference framework in closed-form. This approach allows sequential learning and requires no computationally expensive gradient calculation and Monte Carlo sampling. Our work generalizes the Bayesian algorithm for a single perceptron for binary classification in \cite{H} to multi-layer perceptrons for multi-class classification.
    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (arXiv:2301.13688v1 [cs.AI])
    We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.
    Fast Resolution Agnostic Neural Techniques to Solve Partial Differential Equations. (arXiv:2301.13331v1 [cs.AI])
    Numerical approximations of partial differential equations (PDEs) are routinely employed to formulate the solution of physics, engineering and mathematical problems involving functions of several variables, such as the propagation of heat or sound, fluid flow, elasticity, electrostatics, electrodynamics, and more. While this has led to solving many complex phenomena, there are still significant limitations. Conventional approaches such as Finite Element Methods (FEMs) and Finite Differential Methods (FDMs) require considerable time and are computationally expensive. In contrast, machine learning-based methods such as neural networks are faster once trained, but tend to be restricted to a specific discretization. This article aims to provide a comprehensive summary of conventional methods and recent machine learning-based methods to approximate PDEs numerically. Furthermore, we highlight several key architectures centered around the neural operator, a novel and fast approach (1000x) to learning the solution operator of a PDE. We will note how these new computational approaches can bring immense advantages in tackling many problems in fundamental and applied physics.
    DNN Explanation for Safety Analysis: an Empirical Evaluation of Clustering-based Approaches. (arXiv:2301.13506v1 [cs.SE])
    The adoption of deep neural networks (DNNs) in safety-critical contexts is often prevented by the lack of effective means to explain their results, especially when they are erroneous. In our previous work, we proposed a white-box approach (HUDD) and a black-box approach (SAFE) to automatically characterize DNN failures. They both identify clusters of similar images from a potentially large set of images leading to DNN failures. However, the analysis pipelines for HUDD and SAFE were instantiated in specific ways according to common practices, deferring the analysis of other pipelines to future work. In this paper, we report on an empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. They combine transfer learning, autoencoders, heatmaps of neuron relevance, dimensionality reduction techniques, and different clustering algorithms. Our results show that the best pipeline combines transfer learning, DBSCAN, and UMAP. It leads to clusters almost exclusively capturing images of the same failure scenario, thus facilitating root cause analysis. Further, it generates distinct clusters for each root cause of failure, thus enabling engineers to detect all the unsafe scenarios. Interestingly, these results hold even for failure scenarios that are only observed in a small percentage of the failing images.
    Holistic Graph-based Motion Prediction. (arXiv:2301.13545v1 [cs.RO])
    Motion prediction for automated vehicles in complex environments is a difficult task that is to be mastered when automated vehicles are to be used in arbitrary situations. Many factors influence the future motion of traffic participants starting with traffic rules and reaching from the interaction between each other to personal habits of human drivers. Therefore we present a novel approach for a graph-based prediction based on a heterogeneous holistic graph representation that combines temporal information, properties and relations between traffic participants as well as relations with static elements like the road network. The information are encoded through different types of nodes and edges that both are enriched with arbitrary features. We evaluated the approach on the INTERACTION and the Argoverse dataset and conducted an informative ablation study to demonstrate the benefit of different types of information for the motion prediction quality.
    Contrast and Clustering: Learning Neighborhood Pair Representation for Source-free Domain Adaptation. (arXiv:2301.13428v1 [cs.CV])
    Domain adaptation has attracted a great deal of attention in the machine learning community, but it requires access to source data, which often raises concerns about data privacy. We are thus motivated to address these issues and propose a simple yet efficient method. This work treats domain adaptation as an unsupervised clustering problem and trains the target model without access to the source data. Specifically, we propose a loss function called contrast and clustering (CaC), where a positive pair term pulls neighbors belonging to the same class together in the feature space to form clusters, while a negative pair term pushes samples of different classes apart. In addition, extended neighbors are taken into account by querying the nearest neighbor indexes in the memory bank to mine for more valuable negative pairs. Extensive experiments on three common benchmarks, VisDA, Office-Home and Office-31, demonstrate that our method achieves state-of-the-art performance. The code will be made publicly available at https://github.com/yukilulu/CaC.
    DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. (arXiv:2301.13629v1 [cs.LG])
    Spatio-temporal graph neural networks (STGNN) have emerged as the dominant model for spatio-temporal graph (STG) forecasting. Despite their success, they fail to model intrinsic uncertainties within STG data, which cripples their practicality in downstream tasks for decision-making. To this end, this paper focuses on probabilistic STG forecasting, which is challenging due to the difficulty in modeling uncertainties and complex ST dependencies. In this study, we present the first attempt to generalize the popular denoising diffusion probabilistic models to STGs, leading to a novel non-autoregressive framework called DiffSTG, along with the first denoising network UGnet for STG in the framework. Our approach combines the spatio-temporal learning capabilities of STGNNs with the uncertainty measurements of diffusion models. Extensive experiments validate that DiffSTG reduces the Continuous Ranked Probability Score (CRPS) by 4%-14%, and Root Mean Squared Error (RMSE) by 2%-7% over existing methods on three real-world datasets.
    Support Exploration Algorithm for Sparse Support Recovery. (arXiv:2301.13584v1 [cs.LG])
    We introduce a new algorithm promoting sparsity called {\it Support Exploration Algorithm (SEA)} and analyze it in the context of support recovery/model selection problems.The algorithm can be interpreted as an instance of the {\it straight-through estimator (STE)} applied to the resolution of a sparse linear inverse problem. SEA uses a non-sparse exploratory vector and makes it evolve in the input space to select the sparse support. We put to evidence an oracle update rule for the exploratory vector and consider the STE update. The theoretical analysis establishes general sufficient conditions of support recovery. The general conditions are specialized to the case where the matrix $A$ performing the linear measurements satisfies the {\it Restricted Isometry Property (RIP)}.Experiments show that SEA can efficiently improve the results of any algorithm. Because of its exploratory nature, SEA also performs remarkably well when the columns of $A$ are strongly coherent.
    Stabilize Deep ResNet with A Sharp Scaling Factor $\tau$. (arXiv:1903.07120v5 [cs.LG] UPDATED)
    We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $\tau =O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $\tau>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $\tau$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $\tau$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
    Superhuman Fairness. (arXiv:2301.13420v1 [cs.LG])
    The fairness of machine learning-based decisions has become an increasingly important focus in the design of supervised machine learning methods. Most fairness approaches optimize a specified trade-off between performance measure(s) (e.g., accuracy, log loss, or AUC) and fairness metric(s) (e.g., demographic parity, equalized odds). This begs the question: are the right performance-fairness trade-offs being specified? We instead re-cast fair machine learning as an imitation learning task by introducing superhuman fairness, which seeks to simultaneously outperform human decisions on multiple predictive performance and fairness measures. We demonstrate the benefits of this approach given suboptimal decisions.
    Exploring QSAR Models for Activity-Cliff Prediction. (arXiv:2301.13644v1 [cs.LG])
    Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
    Collision-aware In-hand 6D Object Pose Estimation using Multiple Vision-based Tactile Sensors. (arXiv:2301.13667v1 [cs.RO])
    In this paper, we address the problem of estimating the in-hand 6D pose of an object in contact with multiple vision-based tactile sensors. We reason on the possible spatial configurations of the sensors along the object surface. Specifically, we filter contact hypotheses using geometric reasoning and a Convolutional Neural Network (CNN), trained on simulated object-agnostic images, to promote those that better comply with the actual tactile images from the sensors. We use the selected sensors configurations to optimize over the space of 6D poses using a Gradient Descent-based approach. We finally rank the obtained poses by penalizing those that are in collision with the sensors. We carry out experiments in simulation using the DIGIT vision-based sensor with several objects, from the standard YCB model set. The results demonstrate that our approach estimates object poses that are compatible with actual object-sensor contacts in $87.5\%$ of cases while reaching an average positional error in the order of $2$ centimeters. Our analysis also includes qualitative results of experiments with a real DIGIT sensor.
    The passive symmetries of machine learning. (arXiv:2301.13724v1 [stat.ML])
    Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry and units covariance, all of which have led to important results in physics. Our goal is to understand the implications of passive symmetries for machine learning: Which passive symmetries play a role (e.g., permutation symmetry in graph neural networks)? What are dos and don'ts in machine learning practice? We assay conditions under which passive symmetries can be implemented as group equivariances. We also discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. While this paper is purely conceptual, we believe that it can have a significant impact on helping machine learning make the transition that took place for modern physics in the first half of the Twentieth century.
    Learning, Fast and Slow: A Goal-Directed Memory-Based Approach for Dynamic Environments. (arXiv:2301.13758v1 [cs.AI])
    Model-based next state prediction and state value prediction are slow to converge. To address these challenges, we do the following: i) Instead of a neural network, we do model-based planning using a parallel memory retrieval system (which we term the slow mechanism); ii) Instead of learning state values, we guide the agent's actions using goal-directed exploration, by using a neural network to choose the next action given the current state and the goal state (which we term the fast mechanism). The goal-directed exploration is trained online using hippocampal replay of visited states and future imagined states every single time step, leading to fast and efficient training. Empirical studies show that our proposed method has a 92% solve rate across 100 episodes in a dynamically changing grid world, significantly outperforming state-of-the-art actor critic mechanisms such as PPO (54%), TRPO (50%) and A2C (24%). Ablation studies demonstrate that both mechanisms are crucial. We posit that the future of Reinforcement Learning (RL) will be to model goals and sub-goals for various tasks, and plan it out in a goal-directed memory-based approach.
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R. (arXiv:2103.09603v4 [stat.ML] UPDATED)
    The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.
    Review of methods for automatic cerebral microbleeds detection. (arXiv:2301.13549v1 [cs.CV])
    Cerebral microbleeds detection is an important and challenging task. With the gaining popularity of the MRI, the ability to detect cerebral microbleeds also raises. Unfortunately, for radiologists, it is a time-consuming and laborious procedure. For this reason, various solutions to automate this process have been proposed for several years, but none of them is currently used in medical practice. In this context, the need to systematize the existing knowledge and best practices has been recognized as a factor facilitating the imminent synthesis of a real CMBs detection system practically applicable in medicine. To the best of our knowledge, all available publications regarding automatic cerebral microbleeds detection have been gathered, described, and assessed in this paper in order to distinguish the current research state and provide a starting point for future studies.
    A Mathematical Model for Curriculum Learning. (arXiv:2301.13833v1 [cs.LG])
    Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples, involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. We conduct experiments to support our analysis. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial, while we conjecture that CL with unbounded many curriculum steps can learn this class efficiently.
    What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods. (arXiv:2112.04417v3 [cs.CV] UPDATED)
    A multitude of explainability methods and associated fidelity performance metrics have been proposed to help better understand how modern AI systems make decisions. However, much of the current work has remained theoretical -- without much consideration for the human end-user. In particular, it is not yet known (1) how useful current explainability methods are in practice for more real-world scenarios and (2) how well associated performance metrics accurately predict how much knowledge individual explanations contribute to a human end-user trying to understand the inner-workings of the system. To fill this gap, we conducted psychophysics experiments at scale to evaluate the ability of human participants to leverage representative attribution methods for understanding the behavior of different image classifiers representing three real-world scenarios: identifying bias in an AI system, characterizing the visual strategy it uses for tasks that are too difficult for an untrained non-expert human observer as well as understanding its failure cases. Our results demonstrate that the degree to which individual attribution methods help human participants better understand an AI system varied widely across these scenarios. This suggests a critical need for the field to move past quantitative improvements of current attribution methods towards the development of complementary approaches that provide qualitatively different sources of information to human end-users.
    Towards a Defense Against Federated Backdoor Attacks Under Continuous Training. (arXiv:2205.11736v4 [cs.LG] UPDATED)
    Backdoor attacks are dangerous and difficult to prevent in federated learning (FL), where training data is sourced from untrusted clients over long periods of time. These difficulties arise because: (a) defenders in FL do not have access to raw training data, and (b) a new phenomenon we identify called backdoor leakage causes models trained continuously to eventually suffer from backdoors due to cumulative errors in defense mechanisms. We propose shadow learning, a framework for defending against backdoor attacks in the FL setting under long-range training. Shadow learning trains two models in parallel: a backbone model and a shadow model. The backbone is trained without any defense mechanism to obtain good performance on the main task. The shadow model combines filtering of malicious clients with early-stopping to control the attack success rate even as the data distribution changes. We theoretically motivate our design and show experimentally that our framework significantly improves upon existing defenses against backdoor attacks.
    Company-as-Tribe: Company Financial Risk Assessment on Tribe-Style Graph with Hierarchical Graph Neural Networks. (arXiv:2301.13492v1 [cs.LG])
    Company financial risk is ubiquitous and early risk assessment for listed companies can avoid considerable losses. Traditional methods mainly focus on the financial statements of companies and lack the complex relationships among them. However, the financial statements are often biased and lagged, making it difficult to identify risks accurately and timely. To address the challenges, we redefine the problem as \textbf{company financial risk assessment on tribe-style graph} by taking each listed company and its shareholders as a tribe and leveraging financial news to build inter-tribe connections. Such tribe-style graphs present different patterns to distinguish risky companies from normal ones. However, most nodes in the tribe-style graph lack attributes, making it difficult to directly adopt existing graph learning methods (e.g., Graph Neural Networks(GNNs)). In this paper, we propose a novel Hierarchical Graph Neural Network (TH-GNN) for Tribe-style graphs via two levels, with the first level to encode the structure pattern of the tribes with contrastive learning, and the second level to diffuse information based on the inter-tribe relations, achieving effective and efficient risk assessment. Extensive experiments on the real-world company dataset show that our method achieves significant improvements on financial risk assessment over previous competing methods. Also, the extensive ablation studies and visualization comprehensively show the effectiveness of our method.
    Image Shortcut Squeezing: Countering Perturbative Availability Poisons with Compression. (arXiv:2301.13838v1 [cs.CR])
    Perturbative availability poisoning (PAP) adds small changes to images to prevent their use for model training. Current research adopts the belief that practical and effective approaches to countering such poisons do not exist. In this paper, we argue that it is time to abandon this belief. We present extensive experiments showing that 12 state-of-the-art PAP methods are vulnerable to Image Shortcut Squeezing (ISS), which is based on simple compression. For example, on average, ISS restores the CIFAR-10 model accuracy to $81.73\%$, surpassing the previous best preprocessing-based countermeasures by $37.97\%$ absolute. ISS also (slightly) outperforms adversarial training and has higher generalizability to unseen perturbation norms and also higher efficiency. Our investigation reveals that the property of PAP perturbations depends on the type of surrogate model used for poison generation, and it explains why a specific ISS compression yields the best performance for a specific type of PAP perturbation. We further test stronger, adaptive poisoning, and show it falls short of being an ideal defense against ISS. Overall, our results demonstrate the importance of considering various (simple) countermeasures to ensure the meaningfulness of analysis carried out during the development of availability poisons.
    LogAI: A Library for Log Analytics and Intelligence. (arXiv:2301.13415v1 [cs.AI])
    Software and System logs record runtime information about processes executing within a system. These logs have become the most critical and ubiquitous forms of observability data that help developers understand system behavior, monitor system health and resolve issues. However, the volume of logs generated can be humongous (of the order of petabytes per day) especially for complex distributed systems, such as cloud, search engine, social media, etc. This has propelled a lot of research on developing AI-based log based analytics and intelligence solutions that can process huge volume of raw logs and generate insights. In order to enable users to perform multiple types of AI-based log analysis tasks in a uniform manner, we introduce LogAI (https://github.com/salesforce/logai), a one-stop open source library for log analytics and intelligence. LogAI supports tasks such as log summarization, log clustering and log anomaly detection. It adopts the OpenTelemetry data model, to enable compatibility with different log management platforms. LogAI provides a unified model interface and provides popular time-series, statistical learning and deep learning models. Alongside this, LogAI also provides an out-of-the-box GUI for users to conduct interactive analysis. With LogAI, we can also easily benchmark popular deep learning algorithms for log anomaly detection without putting in redundant effort to process the logs. We have opensourced LogAI to cater to a wide range of applications benefiting both academic research and industrial prototyping.
    Fourier Sensitivity and Regularization of Computer Vision Models. (arXiv:2301.13514v1 [cs.CV])
    Recent work has empirically shown that deep neural networks latch on to the Fourier statistics of training data and show increased sensitivity to Fourier-basis directions in the input. Understanding and modifying this Fourier-sensitivity of computer vision models may help improve their robustness. Hence, in this paper we study the frequency sensitivity characteristics of deep neural networks using a principled approach. We first propose a basis trick, proving that unitary transformations of the input-gradient of a function can be used to compute its gradient in the basis induced by the transformation. Using this result, we propose a general measure of any differentiable model's Fourier-sensitivity using the unitary Fourier-transform of its input-gradient. When applied to deep neural networks, we find that computer vision models are consistently sensitive to particular frequencies dependent on the dataset, training method and architecture. Based on this measure, we further propose a Fourier-regularization framework to modify the Fourier-sensitivities and frequency bias of models. Using our proposed regularizer-family, we demonstrate that deep neural networks obtain improved classification accuracy on robustness evaluations.
    Multicalibration as Boosting for Regression. (arXiv:2301.13767v1 [cs.LG])
    We study the connection between multicalibration and boosting for squared error regression. First we prove a useful characterization of multicalibration in terms of a ``swap regret'' like condition on squared error. Using this characterization, we give an exceedingly simple algorithm that can be analyzed both as a boosting algorithm for regression and as a multicalibration algorithm for a class H that makes use only of a standard squared error regression oracle for H. We give a weak learning assumption on H that ensures convergence to Bayes optimality without the need to make any realizability assumptions -- giving us an agnostic boosting algorithm for regression. We then show that our weak learning assumption on H is both necessary and sufficient for multicalibration with respect to H to imply Bayes optimality. We also show that if H satisfies our weak learning condition relative to another class C then multicalibration with respect to H implies multicalibration with respect to C. Finally we investigate the empirical performance of our algorithm experimentally using an open source implementation that we make available. Our code repository can be found at https://github.com/Declancharrison/Level-Set-Boosting.
    CMLCompiler: A Unified Compiler for Classical Machine Learning. (arXiv:2301.13441v1 [cs.LG])
    Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38x speedup on CPU, 3.31x speedup on GPU, and 5.09x speedup on IoT devices, compared to the state-of-the-art solutions -- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations.
    Real-Time Outlier Detection with Dynamic Process Limits. (arXiv:2301.13527v1 [cs.LG])
    Anomaly detection methods are part of the systems where rare events may endanger an operation's profitability, safety, and environmental aspects. Although many state-of-the-art anomaly detection methods were developed to date, their deployment is limited to the operation conditions present during the model training. Online anomaly detection brings the capability to adapt to data drifts and change points that may not be represented during model development resulting in prolonged service life. This paper proposes an online anomaly detection algorithm for existing real-time infrastructures where low-latency detection is required and novel patterns in data occur unpredictably. The online inverse cumulative distribution-based approach is introduced to eliminate common problems of offline anomaly detectors, meanwhile providing dynamic process limits to normal operation. The benefit of the proposed method is the ease of use, fast computation, and deployability as shown in two case studies of real microgrid operation data.
    Retiring $\Delta$DP: New Distribution-Level Metrics for Demographic Parity. (arXiv:2301.13443v1 [cs.LG])
    Demographic parity is the most widely recognized measure of group fairness in machine learning, which ensures equal treatment of different demographic groups. Numerous works aim to achieve demographic parity by pursuing the commonly used metric $\Delta DP$. Unfortunately, in this paper, we reveal that the fairness metric $\Delta DP$ can not precisely measure the violation of demographic parity, because it inherently has the following drawbacks: \textit{i)} zero-value $\Delta DP$ does not guarantee zero violation of demographic parity, \textit{ii)} $\Delta DP$ values can vary with different classification thresholds. To this end, we propose two new fairness metrics, \textsf{A}rea \textsf{B}etween \textsf{P}robability density function \textsf{C}urves (\textsf{ABPC}) and \textsf{A}rea \textsf{B}etween \textsf{C}umulative density function \textsf{C}urves (\textsf{ABCC}), to precisely measure the violation of demographic parity in distribution level. The new fairness metrics directly measure the difference between the distributions of the prediction probability for different demographic groups. Thus our proposed new metrics enjoy: \textit{i)} zero-value \textsf{ABCC}/\textsf{ABPC} guarantees zero violation of demographic parity; \textit{ii)} \textsf{ABCC}/\textsf{ABPC} guarantees demographic parity while the classification threshold adjusted. We further re-evaluate the existing fair models with our proposed fairness metrics and observe different fairness behaviors of those models under the new metrics.
    Quantum contextual bandits and recommender systems for quantum data. (arXiv:2301.13524v1 [quant-ph])
    We study a recommender system for quantum data using the linear contextual bandit framework. In each round, a learner receives an observable (the context) and has to recommend from a finite set of unknown quantum states (the actions) which one to measure. The learner has the goal of maximizing the reward in each round, that is the outcome of the measurement on the unknown state. Using this model we formulate the low energy quantum state recommendation problem where the context is a Hamiltonian and the goal is to recommend the state with the lowest energy. For this task, we study two families of contexts: the Ising model and a generalized cluster model. We observe that if we interpret the actions as different phases of the models then the recommendation is done by classifying the correct phase of the given Hamiltonian and the strategy can be interpreted as an online quantum phase classifier.
    Learning Data Representations with Joint Diffusion Models. (arXiv:2301.13622v1 [cs.LG])
    We introduce a joint diffusion model that simultaneously learns meaningful internal representations fit for both generative and predictive tasks. Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models in both generative and predictive settings. We then introduce an extension of the vanilla diffusion model with a classifier that allows for stable joint training with shared parametrization between those objectives. The resulting joint diffusion model offers superior performance across various tasks, including generative modeling, semi-supervised classification, and domain adaptation.
    Domain-Generalizable Multiple-Domain Clustering. (arXiv:2301.13530v1 [cs.LG])
    Accurately clustering high-dimensional measurements is vital for adequately analyzing scientific data. Deep learning machinery has remarkably improved clustering capabilities in recent years due to its ability to extract meaningful representations. In this work, we are given unlabeled samples from multiple source domains, and we aim to learn a shared classifier that assigns the examples to various clusters. Evaluation is done by using the classifier for predicting cluster assignments in a previously unseen domain. This setting generalizes the problem of unsupervised domain generalization to the case in which no supervised learning samples are given (completely unsupervised). Towards this goal, we present an end-to-end model and evaluate its capabilities on several multi-domain image datasets. Specifically, we demonstrate that our model is more accurate than schemes that require fine-tuning using samples from the target domain or some level of supervision.
    BALANCE: Bayesian Linear Attribution for Root Cause Localization. (arXiv:2301.13572v1 [cs.LG])
    Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations, as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimensional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a forward manner while promoting sparsity and concurrently paying attention to the correlation between the candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of accuracy with the least amount of running time, and achieves at least $6\%$ notably higher accuracy than SOTA methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and the online results further advocate its usage for real-time diagnosis in distributed data systems.
    Affinity Uncertainty-based Hard Negative Mining in Graph Contrastive Learning. (arXiv:2301.13340v1 [cs.LG])
    Hard negative mining has shown effective in enhancing self-supervised contrastive learning (CL) on diverse data types, including graph contrastive learning (GCL). Existing hardness-aware CL methods typically treat negative instances that are most similar to the anchor instance as hard negatives, which helps improve the CL performance, especially on image data. However, this approach often fails to identify the hard negatives but leads to many false negatives on graph data. This is mainly due to that the learned graph representations are not sufficiently discriminative due to over-smooth representations and/or non-i.i.d. issues in graph data. To tackle this problem, this paper proposes a novel approach that builds a discriminative model on collective affinity information (i.e, two sets of pairwise affinities between the negative instances and the anchor instance) to mine hard negatives in GCL. In particular, the proposed approach evaluates how confident/uncertain the discriminative model is about the affinity of each negative instance to an anchor instance to determine its hardness weight relative to the anchor instance. This uncertainty information is then incorporated into existing GCL loss functions via a weighting term to enhance their performance. The enhanced GCL is theoretically grounded that the resulting GCL loss is equivalent to a triplet loss with an adaptive margin being exponentially proportional to the learned uncertainty of each negative instance. Extensive experiments on 10 graph datasets show that our approach i) consistently enhances different state-of-the-art GCL methods in both graph and node classification tasks, and ii) significantly improves their robustness against adversarial attacks.
    An investigation of challenges encountered when specifying training data and runtime monitors for safety critical ML applications. (arXiv:2301.13476v1 [cs.SE])
    Context and motivation: The development and operation of critical software that contains machine learning (ML) models requires diligence and established processes. Especially the training data used during the development of ML models have major influences on the later behaviour of the system. Runtime monitors are used to provide guarantees for that behaviour. Question / problem: We see major uncertainty in how to specify training data and runtime monitoring for critical ML models and by this specifying the final functionality of the system. In this interview-based study we investigate the underlying challenges for these difficulties. Principal ideas/results: Based on ten interviews with practitioners who develop ML models for critical applications in the automotive and telecommunication sector, we identified 17 underlying challenges in 6 challenge groups that relate to the challenge of specifying training data and runtime monitoring. Contribution: The article provides a list of the identified underlying challenges related to the difficulties practitioners experience when specifying training data and runtime monitoring for ML models. Furthermore, interconnection between the challenges were found and based on these connections recommendation proposed to overcome the root causes for the challenges.
    Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning. (arXiv:2301.13618v1 [cs.LG])
    Many real-time applications (e.g., Augmented/Virtual Reality, cognitive assistance) rely on Deep Neural Networks (DNNs) to process inference tasks. Edge computing is considered a key infrastructure to deploy such applications, as moving computation close to the data sources enables us to meet stringent latency and throughput requirements. However, the constrained nature of edge networks poses several additional challenges to the management of inference workloads: edge clusters can not provide unlimited processing power to DNN models, and often a trade-off between network and processing time should be considered when it comes to end-to-end delay requirements. In this paper, we focus on the problem of scheduling inference queries on DNN models in edge networks at short timescales (i.e., few milliseconds). By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP, highlighting the need for a dynamic scheduling policy that can adapt to network conditions and workloads. We therefore design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions. Our results show that ASET effectively provides the best performance compared to static policies when scheduling over a distributed pool of edge resources.
    Faster Predict-and-Optimize with Three-Operator Splitting. (arXiv:2301.13395v1 [cs.LG])
    In many practical settings, a combinatorial problem must be repeatedly solved with similar, but distinct parameters w. Yet, w is not directly observed; only contextual data d that correlates with w is available. It is tempting to use a neural network to predict w given d, but training such a model requires reconciling the discrete nature of combinatorial optimization with the gradient-based frameworks used to train neural networks. One approach to overcoming this issue is to consider a continuous relaxation of the combinatorial problem. While existing such approaches have shown to be highly effective on small problems (10-100 variables) they do not scale well to large problems. In this work, we show how recent results in operator splitting can be used to design such a system which is easy to train and scales effortlessly to problems with thousands of variables.
    Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond. (arXiv:2301.13486v1 [stat.ML])
    In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.
    Fine Robotic Manipulation without Force/Torque Sensor. (arXiv:2301.13413v1 [cs.RO])
    Force Sensing and Force Control are essential to many industrial applications. Typically, a 6-axis Force/Torque (F/T) sensor is mounted between the robot's wrist and the end-effector in order to measure the forces and torques exerted by the environment onto the robot (the external wrench). Although a typical 6-axis F/T sensor can provide highly accurate measurements, it is expensive and vulnerable to drift and external impacts. Existing methods aiming at estimating the external wrench using only the robot's internal signals are limited in scope: for example, wrench estimation accuracy was mostly validated in free-space motions and simple contacts as opposed to tasks like assembly that require high-precision force control. Here we present a Neural Network based method and argue that by devoting particular attention to the training data structure, it is possible to accurately estimate the external wrench in a wide range of scenarios based solely on internal signals. As an illustration, we demonstrate a pin insertion experiment with 100-micron clearance and a hand-guiding experiment, both performed without external F/T sensors or joint torque sensors. Our result opens the possibility of equipping the existing 2.7 million industrial robots with Force Sensing and Force Control capabilities without any additional hardware.
    Training with Mixed-Precision Floating-Point Assignments. (arXiv:2301.13464v1 [cs.LG])
    When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy loss. Hence, it is important to use a precision assignment -- a mapping from all tensors (arising in training) to precision levels (high or low) -- that keeps most of the tensors in low precision and leads to sufficiently accurate models. We provide a technique that explores this memory-accuracy tradeoff by generating precision assignments that (i) use less memory and (ii) lead to more accurate models at the same time, compared to the precision assignments considered by prior work in low-precision floating-point training. Our method typically provides > 2x memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading off accuracy. Compared to other baselines which sometimes cause training to diverge, our method provides similar or better memory reduction while avoiding divergence.
    Incorporating Recurrent Reinforcement Learning into Model Predictive Control for Adaptive Control in Autonomous Driving. (arXiv:2301.13313v1 [cs.LG])
    Model Predictive Control (MPC) is attracting tremendous attention in the autonomous driving task as a powerful control technique. The success of an MPC controller strongly depends on an accurate internal dynamics model. However, the static parameters, usually learned by system identification, often fail to adapt to both internal and external perturbations in real-world scenarios. In this paper, we firstly (1) reformulate the problem as a Partially Observed Markov Decision Process (POMDP) that absorbs the uncertainties into observations and maintains Markov property into hidden states; and (2) learn a recurrent policy continually adapting the parameters of the dynamics model via Recurrent Reinforcement Learning (RRL) for optimal and adaptive control; and (3) finally evaluate the proposed algorithm (referred as $\textit{MPC-RRL}$) in CARLA simulator and leading to robust behaviours under a wide range of perturbations.
    Population-wise Labeling of Sulcal Graphs using Multi-graph Matching. (arXiv:2301.13532v1 [stat.ML])
    Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
    Causality-based CTR Prediction using Graph Neural Networks. (arXiv:2301.12762v1 [cs.IR] CROSS LISTED)
    As a prevalent problem in online advertising, CTR prediction has attracted plentiful attention from both academia and industry. Recent studies have been reported to establish CTR prediction models in the graph neural networks (GNNs) framework. However, most of GNNs-based models handle feature interactions in a complete graph, while ignoring causal relationships among features, which results in a huge drop in the performance on out-of-distribution data. This paper is dedicated to developing a causality-based CTR prediction model in the GNNs framework (Causal-GNN) integrating representations of feature graph, user graph and ad graph in the context of online advertising. In our model, a structured representation learning method (GraphFwFM) is designed to capture high-order representations on feature graph based on causal discovery among field features in gated graph neural networks (GGNNs), and GraphSAGE is employed to obtain graph representations of users and ads. Experiments conducted on three public datasets demonstrate the superiority of Causal-GNN in AUC and Logloss and the effectiveness of GraphFwFM in capturing high-order representations on causal feature graph.
    Proxy-based Zero-Shot Entity Linking by Effective Candidate Retrieval. (arXiv:2301.13318v1 [cs.LG])
    A recent advancement in the domain of biomedical Entity Linking is the development of powerful two-stage algorithms, an initial candidate retrieval stage that generates a shortlist of entities for each mention, followed by a candidate ranking stage. However, the effectiveness of both stages are inextricably dependent on computationally expensive components. Specifically, in candidate retrieval via dense representation retrieval it is important to have hard negative samples, which require repeated forward passes and nearest neighbour searches across the entire entity label set throughout training. In this work, we show that pairing a proxy-based metric learning loss with an adversarial regularizer provides an efficient alternative to hard negative sampling in the candidate retrieval stage. In particular, we show competitive performance on the recall@1 metric, thereby providing the option to leave out the expensive candidate ranking step. Finally, we demonstrate how the model can be used in a zero-shot setting to discover out of knowledge base biomedical entities.
    A Framework for Adapting Offline Algorithms to Solve Combinatorial Multi-Armed Bandit Problems with Bandit Feedback. (arXiv:2301.13326v1 [cs.LG])
    We investigate the problem of stochastic, combinatorial multi-armed bandits where the learner only has access to bandit feedback and the reward function can be non-linear. We provide a general framework for adapting discrete offline approximation algorithms into sublinear $\alpha$-regret methods that only require bandit feedback, achieving $\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$ expected cumulative $\alpha$-regret dependence on the horizon $T$. The framework only requires the offline algorithms to be robust to small errors in function evaluation. The adaptation procedure does not even require explicit knowledge of the offline approximation algorithm -- the offline algorithm can be used as black box subroutine. To demonstrate the utility of the proposed framework, the proposed framework is applied to multiple problems in submodular maximization, adapting approximation algorithms for cardinality and for knapsack constraints. The new CMAB algorithms for knapsack constraints outperform a full-bandit method developed for the adversarial setting in experiments with real-world data.
    On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. (arXiv:2301.13370v1 [cs.LG])
    Recent work has shown that automatic differentiation over the reals is almost always correct in a mathematically precise sense. However, actual programs work with machine-representable numbers (e.g., floating-point numbers), not reals. In this paper, we study the correctness of automatic differentiation when the parameter space of a neural network consists solely of machine-representable numbers. For a neural network with bias parameters, we prove that automatic differentiation is correct at all parameters where the network is differentiable. In contrast, it is incorrect at all parameters where the network is non-differentiable, since it never informs non-differentiability. To better understand this non-differentiable set of parameters, we prove a tight bound on its size, which is linear in the number of non-differentiabilities in activation functions, and provide a simple necessary and sufficient condition for a parameter to be in this set. We further prove that automatic differentiation always computes a Clarke subderivative, even on the non-differentiable set. We also extend these results to neural networks possibly without bias parameters.
    Understanding Self-Distillation in the Presence of Label Noise. (arXiv:2301.13304v1 [cs.LG])
    Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.
    Continuous Spatiotemporal Transformers. (arXiv:2301.13338v1 [cs.LG])
    Modeling spatiotemporal dynamical systems is a fundamental challenge in machine learning. Transformer models have been very successful in NLP and computer vision where they provide interpretable representations of data. However, a limitation of transformers in modeling continuous dynamical systems is that they are fundamentally discrete time and space models and thus have no guarantees regarding continuous sampling. To address this challenge, we present the Continuous Spatiotemporal Transformer (CST), a new transformer architecture that is designed for the modeling of continuous systems. This new framework guarantees a continuous and smooth output via optimization in Sobolev space. We benchmark CST against traditional transformers as well as other spatiotemporal dynamics modeling methods and achieve superior performance in a number of tasks on synthetic and real systems, including learning brain dynamics from calcium imaging data.
    Deep Learning for Reference-Free Geolocation for Poplar Trees. (arXiv:2301.13387v1 [q-bio.GN])
    A core task in precision agriculture is the identification of climatic and ecological conditions that are advantageous for a given crop. The most succinct approach is geolocation, which is concerned with locating the native region of a given sample based on its genetic makeup. Here, we investigate genomic geolocation of Populus trichocarpa, or poplar, which has been identified by the US Department of Energy as a fast-rotation biofuel crop to be harvested nationwide. In particular, we approach geolocation from a reference-free perspective, circumventing the need for compute-intensive processes such as variant calling and alignment. Our model, MashNet, predicts latitude and longitude for poplar trees from randomly-sampled, unaligned sequence fragments. We show that our model performs comparably to Locator, a state-of-the-art method based on aligned whole-genome sequence data. MashNet achieves an error of 34.0 km^2 compared to Locator's 22.1 km^2. MashNet allows growers to quickly and efficiently identify natural varieties that will be most productive in their growth environment based on genotype. This paper explores geolocation for precision agriculture while providing a framework and data source for further development by the machine learning community.
    Combinatorial Causal Bandits without Graph Skeleton. (arXiv:2301.13392v1 [cs.LG])
    In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we sacrifice the regret to $O(T^{\frac{2}{3}}\ln T)$ to remove the weight gap covered by the asymptotic notation. At last, we give some discussions and algorithms for pure exploration of the CCB problem without the graph structure.
    Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization. (arXiv:2301.13303v1 [stat.ML])
    To achieve scalable and accurate inference for latent Gaussian processes, we propose a variational approximation based on a family of Gaussian distributions whose covariance matrices have sparse inverse Cholesky (SIC) factors. We combine this variational approximation of the posterior with a similar and efficient SIC-restricted Kullback-Leibler-optimal approximation of the prior. We then focus on a particular SIC ordering and nearest-neighbor-based sparsity pattern resulting in highly accurate prior and posterior approximations. For this setting, our variational approximation can be computed via stochastic gradient descent in polylogarithmic time per iteration. We provide numerical comparisons showing that the proposed double-Kullback-Leibler-optimal Gaussian-process approximation (DKLGP) can sometimes be vastly more accurate than alternative approaches such as inducing-point and mean-field approximations at similar computational complexity.
    Scaling laws for single-agent reinforcement learning. (arXiv:2301.13442v1 [cs.LG])
    Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.
    GDOD: Effective Gradient Descent using Orthogonal Decomposition for Multi-Task Learning. (arXiv:2301.13465v1 [cs.LG])
    Multi-task learning (MTL) aims at solving multiple related tasks simultaneously and has experienced rapid growth in recent years. However, MTL models often suffer from performance degeneration with negative transfer due to learning several tasks simultaneously. Some related work attributed the source of the problem is the conflicting gradients. In this case, it is needed to select useful gradient updates for all tasks carefully. To this end, we propose a novel optimization approach for MTL, named GDOD, which manipulates gradients of each task using an orthogonal basis decomposed from the span of all task gradients. GDOD decomposes gradients into task-shared and task-conflict components explicitly and adopts a general update rule for avoiding interference across all task gradients. This allows guiding the update directions depending on the task-shared components. Moreover, we prove the convergence of GDOD theoretically under both convex and non-convex assumptions. Experiment results on several multi-task datasets not only demonstrate the significant improvement of GDOD performed to existing MTL models but also prove that our algorithm outperforms state-of-the-art optimization methods in terms of AUC and Logloss metrics.
    Demystifying Disagreement-on-the-Line in High Dimensions. (arXiv:2301.13371v1 [stat.ML])
    Evaluating the performance of machine learning models under distribution shift is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackle this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have lead to the discovery of the disagreement-on-the-line phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny ImageNet-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings.
    Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance. (arXiv:2301.13376v1 [cs.LG])
    We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during training using accumulator bit width bounds that we derive. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline. We then show that this reduction translates to increased design efficiency for custom FPGA-based accelerators. Finally, we show that our algorithm not only constrains weights to fit into an accumulator of user-defined bit width, but also increases the sparsity and compressibility of the resulting weights. Across all of our benchmark models trained with 8-bit weights and activations, we observe that constraining the hidden layers of quantized neural networks to fit into 16-bit accumulators yields an average 98.2% sparsity with an estimated compression rate of 46.5x all while maintaining 99.2% of the floating-point performance.
    CRISP: Curriculum based Sequential Neural Decoders for Polar Code Family. (arXiv:2210.00313v2 [cs.IT] UPDATED)
    Polar codes are widely used state-of-the-art codes for reliable communication that have recently been included in the 5th generation wireless standards (5G). However, there remains room for the design of polar decoders that are both efficient and reliable in the short blocklength regime. Motivated by recent successes of data-driven channel decoders, we introduce a novel $\textbf{C}$ur$\textbf{RI}$culum based $\textbf{S}$equential neural decoder for $\textbf{P}$olar codes (CRISP). We design a principled curriculum, guided by information-theoretic insights, to train CRISP and show that it outperforms the successive-cancellation (SC) decoder and attains near-optimal reliability performance on the Polar(32,16) and Polar(64,22) codes. The choice of the proposed curriculum is critical in achieving the accuracy gains of CRISP, as we show by comparing against other curricula. More notably, CRISP can be readily extended to Polarization-Adjusted-Convolutional (PAC) codes, where existing SC decoders are significantly less reliable. To the best of our knowledge, CRISP constructs the first data-driven decoder for PAC codes and attains near-optimal performance on the PAC(32,16) code.
    Autobidders with Budget and ROI Constraints: Efficiency, Regret, and Pacing Dynamics. (arXiv:2301.13306v1 [cs.GT])
    We study a game between autobidding algorithms that compete in an online advertising platform. Each autobidder is tasked with maximizing its advertiser's total value over multiple rounds of a repeated auction, subject to budget and/or return-on-investment constraints. We propose a gradient-based learning algorithm that is guaranteed to satisfy all constraints and achieves vanishing individual regret. Our algorithm uses only bandit feedback and can be used with the first- or second-price auction, as well as with any "intermediate" auction format. Our main result is that when these autobidders play against each other, the resulting expected liquid welfare over all rounds is at least half of the expected optimal liquid welfare achieved by any allocation. This holds whether or not the bidding dynamics converges to an equilibrium and regardless of the correlation structure between advertiser valuations.
    GeneFormer: Learned Gene Compression using Transformer-based Context Modeling. (arXiv:2212.08379v3 [cs.LG] UPDATED)
    With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like G-zip, which fails to utilize the interrelation of nucleotide sequence. Recently, many researchers begin to investigate deep learning based gene data compression method. In this paper, we propose a transformer-based gene compression method named GeneFormer. Specifically, we first introduce a modified transformer structure to fully explore the nucleotide sequence dependency. Then, we propose fixed-length parallel grouping to accelerate the decoding speed of our autoregressive model. Experimental results on real-world datasets show that our method saves 29.7% bit rate compared with the state-of-the-art method, and the decoding speed is significantly faster than all existing learning-based gene compression methods.
    Time Series Forecasting via Semi-Asymmetric Convolutional Architecture with Global Atrous Sliding Window. (arXiv:2301.13691v1 [cs.AI])
    The proposed method in this paper is designed to address the problem of time series forecasting. Although some exquisitely designed models achieve excellent prediction performances, how to extract more useful information and make accurate predictions is still an open issue. Most of modern models only focus on a short range of information, which are fatal for problems such as time series forecasting which needs to capture long-term information characteristics. As a result, the main concern of this work is to further mine relationship between local and global information contained in time series to produce more precise predictions. In this paper, to satisfactorily realize the purpose, we make three main contributions that are experimentally verified to have performance advantages. Firstly, original time series is transformed into difference sequence which serves as input to the proposed model. And secondly, we introduce the global atrous sliding window into the forecasting model which references the concept of fuzzy time series to associate relevant global information with temporal data within a time period and utilizes central-bidirectional atrous algorithm to capture underlying-related features to ensure validity and consistency of captured data. Thirdly, a variation of widely-used asymmetric convolution which is called semi-asymmetric convolution is devised to more flexibly extract relationships in adjacent elements and corresponding associated global features with adjustable ranges of convolution on vertical and horizontal directions. The proposed model in this paper achieves state-of-the-art on most of time series datasets provided compared with competitive modern models.
    Grounding Language Models to Images for Multimodal Generation. (arXiv:2301.13823v1 [cs.CL])
    We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process and generate arbitrarily interleaved image-and-text data. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
    Active Learning-based Domain Adaptive Localized Polynomial Chaos Expansion. (arXiv:2301.13635v1 [cs.LG])
    The paper presents a novel methodology to build surrogate models of complicated functions by an active learning-based sequential decomposition of the input random space and construction of localized polynomial chaos expansions, referred to as domain adaptive localized polynomial chaos expansion (DAL-PCE). The approach utilizes sequential decomposition of the input random space into smaller sub-domains approximated by low-order polynomial expansions. This allows approximation of functions with strong nonlinearties, discontinuities, and/or singularities. Decomposition of the input random space and local approximations alleviates the Gibbs phenomenon for these types of problems and confines error to a very small vicinity near the non-linearity. The global behavior of the surrogate model is therefore significantly better than existing methods as shown in numerical examples. The whole process is driven by an active learning routine that uses the recently proposed $\Theta$ criterion to assess local variance contributions. The proposed approach balances both \emph{exploitation} of the surrogate model and \emph{exploration} of the input random space and thus leads to efficient and accurate approximation of the original mathematical model. The numerical results show the superiority of the DAL-PCE in comparison to (i) a single global polynomial chaos expansion and (ii) the recently proposed stochastic spectral embedding (SSE) method developed as an accurate surrogate model and which is based on a similar domain decomposition process. This method represents general framework upon which further extensions and refinements can be based, and which can be combined with any technique for non-intrusive polynomial chaos expansion construction.
    Enhancing Hyper-To-Real Space Projections Through Euclidean Norm Meta-Heuristic Optimization. (arXiv:2301.13671v1 [cs.LG])
    The continuous computational power growth in the last decades has made solving several optimization problems significant to humankind a tractable task; however, tackling some of them remains a challenge due to the overwhelming amount of candidate solutions to be evaluated, even by using sophisticated algorithms. In such a context, a set of nature-inspired stochastic methods, called meta-heuristic optimization, can provide robust approximate solutions to different kinds of problems with a small computational burden, such as derivative-free real function optimization. Nevertheless, these methods may converge to inadequate solutions if the function landscape is too harsh, e.g., enclosing too many local optima. Previous works addressed this issue by employing a hypercomplex representation of the search space, like quaternions, where the landscape becomes smoother and supposedly easier to optimize. Under this approach, meta-heuristic computations happen in the hypercomplex space, whereas variables are mapped back to the real domain before function evaluation. Despite this latter operation being performed by the Euclidean norm, we have found that after the optimization procedure has finished, it is usually possible to obtain even better solutions by employing the Minkowski $p$-norm instead and fine-tuning $p$ through an auxiliary sub-problem with neglecting additional cost and no hyperparameters. Such behavior was observed in eight well-established benchmarking functions, thus fostering a new research direction for hypercomplex meta-heuristic optimization.
    Complete Neural Networks for Euclidean Graphs. (arXiv:2301.13821v1 [cs.LG])
    We propose a 2-WL-like geometric graph isomorphism test and prove it is complete when applied to Euclidean Graphs in $\mathbb{R}^3$. We then use recent results on multiset embeddings to devise an efficient geometric GNN model with equivalent separation power. We verify empirically that our GNN model is able to separate particularly challenging synthetic examples, and demonstrate its usefulness for a chemical property prediction problem.
    Archetypal Analysis++: Rethinking the Initialization Strategy. (arXiv:2301.13748v1 [cs.LG])
    Archetypal analysis is a matrix factorization method with convexity constraints. Due to local minima, a good initialization is essential. Frequently used initialization methods yield either sub-optimal starting points or are prone to get stuck in poor local minima. In this paper, we propose archetypal analysis++ (AA++), a probabilistic initialization strategy for archetypal analysis that sequentially samples points based on their influence on the objective, similar to $k$-means++. In fact, we argue that $k$-means++ already approximates the proposed initialization method. Furthermore, we suggest to adapt an efficient Monte Carlo approximation of $k$-means++ to AA++. In an extensive empirical evaluation of 13 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost consistently outperforms all baselines, including the most frequently used ones.
    Deep learning-based lung segmentation and automatic regional template in chest X-ray images for pediatric tuberculosis. (arXiv:2301.13786v1 [eess.IV])
    Tuberculosis (TB) is still considered a leading cause of death and a substantial threat to global child health. Both TB infection and disease are curable using antibiotics. However, most children who die of TB are never diagnosed or treated. In clinical practice, experienced physicians assess TB by examining chest X-rays (CXR). Pediatric CXR has specific challenges compared to adult CXR, which makes TB diagnosis in children more difficult. Computer-aided diagnosis systems supported by Artificial Intelligence have shown performance comparable to experienced radiologist TB readings, which could ease mass TB screening and reduce clinical burden. We propose a multi-view deep learning-based solution which, by following a proposed template, aims to automatically regionalize and extract lung and mediastinal regions of interest from pediatric CXR images where key TB findings may be present. Experimental results have shown accurate region extraction, which can be used for further analysis to confirm TB finding presence and severity assessment. Code publicly available at https://github.com/dani-capellan/pTB_LungRegionExtractor.
    An Efficient Solution to s-Rectangular Robust Markov Decision Processes. (arXiv:2301.13642v1 [cs.LG])
    We present an efficient robust value iteration for \texttt{s}-rectangular robust Markov Decision Processes (MDPs) with a time complexity comparable to standard (non-robust) MDPs which is significantly faster than any existing method. We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma. We unveil the exact form of the optimal policies, which turn out to be novel threshold policies with the probability of playing an action proportional to its advantage.
    Improved distinct bone segmentation in upper-body CT through multi-resolution networks. (arXiv:2301.13674v1 [eess.IV])
    Purpose: Automated distinct bone segmentation from CT scans is widely used in planning and navigation workflows. U-Net variants are known to provide excellent results in supervised semantic segmentation. However, in distinct bone segmentation from upper body CTs a large field of view and a computationally taxing 3D architecture are required. This leads to low-resolution results lacking detail or localisation errors due to missing spatial context when using high-resolution inputs. Methods: We propose to solve this problem by using end-to-end trainable segmentation networks that combine several 3D U-Nets working at different resolutions. Our approach, which extends and generalizes HookNet and MRN, captures spatial information at a lower resolution and skips the encoded information to the target network, which operates on smaller high-resolution inputs. We evaluated our proposed architecture against single resolution networks and performed an ablation study on information concatenation and the number of context networks. Results: Our proposed best network achieves a median DSC of 0.86 taken over all 125 segmented bone classes and reduces the confusion among similar-looking bones in different locations. These results outperform our previously published 3D U-Net baseline results on the task and distinct-bone segmentation results reported by other groups. Conclusion: The presented multi-resolution 3D U-Nets address current shortcomings in bone segmentation from upper-body CT scans by allowing for capturing a larger field of view while avoiding the cubic growth of the input pixels and intermediate computations that quickly outgrow the computational capacities in 3D. The approach thus improves the accuracy and efficiency of distinct bone segmentation from upper-body CT.
    Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. (arXiv:2301.13826v1 [cs.CV])
    Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.
    Anti-Exploration by Random Network Distillation. (arXiv:2301.13616v1 [cs.LG])
    Despite the success of Random Network Distillation (RND) in various domains, it was shown as not discriminative enough to be used as an uncertainty estimator for penalizing out-of-distribution actions in offline reinforcement learning. In this paper, we revisit these results and show that, with a naive choice of conditioning for the RND prior, it becomes infeasible for the actor to effectively minimize the anti-exploration bonus and discriminativity is not an issue. We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM), resulting in a simple and efficient ensemble-free algorithm based on Soft Actor-Critic. We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
    UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers. (arXiv:2301.13741v1 [cs.CV])
    Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.
    Semi-Supervised Classification with Graph Convolutional Kernel Machines. (arXiv:2301.13764v1 [cs.LG])
    We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. First, we introduce an unsupervised kernel machine propagating the node features in a one-hop neighbourhood. Then, we specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. The deep graph convolutional kernel machine is obtained by stacking multiple shallow kernel machines. After showing that unsupervised and semi-supervised layer corresponds to an eigenvalue problem and a linear system on the aggregated node features, respectively, we derive an efficient end-to-end training algorithm in the dual variables. Numerical experiments demonstrate that our approach is competitive with state-of-the-art graph neural networks for homophilious and heterophilious benchmark datasets. Notably, GCKM achieves superior performance when very few labels are available.
    Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning. (arXiv:2301.13703v1 [cs.LG])
    Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $\alpha$ are varied. For gradient descent, $\alpha$ is a key parameter that controls if the network is `lazy' ($\alpha\gg 1$) or instead learns features ($\alpha\ll 1$). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the $(\alpha,T)$ plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing $T$ or decreasing $\alpha$ both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that key dynamical quantities (including the total variations of weights during training) depend on both $T$ and $P$ as power laws, and the characteristic temperature $T_c$, where the noise of SGD starts affecting performance, is a power law of $P$. These observations indicate that a key effect of SGD noise occurs late in training, by affecting the stopping process whereby all data are fitted. We argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. The same effect occurs at larger training set $P$. We confirm this view in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.
    Simplex Random Features. (arXiv:2301.13856v1 [stat.ML])
    We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels by geometrical correlation of random projection vectors. We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels among the class of weight-independent geometrically-coupled positive random feature (PRF) mechanisms, substantially outperforming the previously most accurate Orthogonal Random Features at no observable extra cost. We present a more computationally expensive SimRFs+ variant, which we prove is asymptotically optimal in the broader family of weight-dependent geometrical coupling schemes (which permit correlations between random vector directions and norms). In extensive empirical studies, we show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
    Alternating Updates for Efficient Transformers. (arXiv:2301.13310v1 [cs.LG])
    It is well established that increasing scale in deep transformer networks leads to improved quality and performance. This increase in scale often comes with an increase in compute cost and inference latency. Consequently, research into methods which help realize the benefits of increased scale without leading to an increase in the compute cost becomes important. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation without increasing the computation time by working on a subblock of the representation at each layer. Our experiments on various transformer models and language tasks demonstrate the consistent effectiveness of alternating updates on a diverse set of benchmarks. Finally, we present extensions of AltUp to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity.
    Differentially Private Distributed Bayesian Linear Regression with MCMC. (arXiv:2301.13778v1 [stat.ML])
    We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-preserving noise. We develop a novel generative statistical model for privately shared statistics, which exploits a useful distributional relation between the summary statistics of linear regression. Bayesian estimation of the regression coefficients is conducted mainly using Markov chain Monte Carlo algorithms, while we also provide a fast version to perform Bayesian estimation in one iteration. The proposed methods have computational advantages over their competitors. We provide numerical results on both real and simulated data, which demonstrate that the proposed algorithms provide well-rounded estimation and prediction.
    Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback. (arXiv:2301.13791v1 [stat.ML])
    We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
    A Data-Driven Modeling and Control Framework for Physics-Based Building Emulators. (arXiv:2301.13447v1 [eess.SY])
    We present a data-driven modeling and control framework for physics-based building emulators. Our approach comprises: (a) Offline training of differentiable surrogate models that speed up model evaluations, provide cheap gradients, and have good predictive accuracy for the receding horizon in Model Predictive Control (MPC) and (b) Formulating and solving nonlinear building HVAC MPC problems. We extensively verify the modeling and control performance using multiple surrogate models and optimization frameworks for different available test cases in the Building Optimization Testing Framework (BOPTEST). The framework is compatible with other modeling techniques and customizable with different control formulations. The modularity makes the approach future-proof for test cases currently in development for physics-based building emulators and provides a path toward prototyping predictive controllers in large buildings.
    Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos for MediaEval 2022. (arXiv:2301.13576v1 [cs.AI])
    Sports video analysis is a widespread research topic. Its applications are very diverse, like events detection during a match, video summary, or fine-grained movement analysis of athletes. As part of the MediaEval 2022 benchmarking initiative, this task aims at detecting and classifying subtle movements from sport videos. We focus on recordings of table tennis matches. Conducted since 2019, this task provides a classification challenge from untrimmed videos recorded under natural conditions with known temporal boundaries for each stroke. Since 2021, the task also provides a stroke detection challenge from unannotated, untrimmed videos. This year, the training, validation, and test sets are enhanced to ensure that all strokes are represented in each dataset. The dataset is now similar to the one used in [1, 2]. This research is intended to build tools for coaches and athletes who want to further evaluate their sport performances.
    Execution-based Code Generation using Deep Reinforcement Learning. (arXiv:2301.13816v1 [cs.LG])
    The utilization of programming language (PL) models, pretrained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting specific sequence-level features of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that combines pretrained PL models with Proximal Policy Optimization (PPO) deep reinforcement learning and employs execution feedback as the external source of knowledge into the model optimization. PPOCoder is transferable across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, improving the success rate of compilation and functional correctness over different PLs. Our code can be found at https://github.com/reddy-lab-code-research/PPOCoder .
    Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis. (arXiv:2301.13819v1 [cs.CL])
    ChatGPT has demonstrated exceptional proficiency in natural language conversation, e.g., it can answer a wide range of questions while no previous large language models can. Thus, we would like to push its limit and explore its ability to answer causal discovery questions by using a medical benchmark (Tu et al. 2019) in causal discovery.
    A Bayesian Generative Adversarial Network (GAN) to Generate Synthetic Time-Series Data, Application in Combined Sewer Flow Prediction. (arXiv:2301.13733v1 [cs.LG])
    Despite various breakthroughs in machine learning and data analysis techniques for improving smart operation and management of urban water infrastructures, some key limitations obstruct this progress. Among these shortcomings, the absence of freely available data due to data privacy or high costs of data gathering and the nonexistence of adequate rare or extreme events in the available data plays a crucial role. Here, Generative Adversarial Networks (GANs) can help overcome these challenges. In machine learning, generative models are a class of methods capable of learning data distribution to generate artificial data. In this study, we developed a GAN model to generate synthetic time series to balance our limited recorded time series data and improve the accuracy of a data-driven model for combined sewer flow prediction. We considered the sewer system of a small town in Germany as the test case. Precipitation and inflow to the storage tanks are used for the Data-Driven model development. The aim is to predict the flow using precipitation data and examine the impact of data augmentation using synthetic data in model performance. Results show that GAN can successfully generate synthetic time series from real data distribution, which helps more accurate peak flow prediction. However, the model without data augmentation works better for dry weather prediction. Therefore, an ensemble model is suggested to combine the advantages of both models.
    Improving Monte Carlo Evaluation with Offline Data. (arXiv:2301.13734v1 [cs.LG])
    Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
    A Survey of Explainable AI in Deep Visual Modeling: Methods and Metrics. (arXiv:2301.13445v1 [cs.CV])
    Deep visual models have widespread applications in high-stake domains. Hence, their black-box nature is currently attracting a large interest of the research community. We present the first survey in Explainable AI that focuses on the methods and metrics for interpreting deep visual models. Covering the landmark contributions along the state-of-the-art, we not only provide a taxonomic organization of the existing techniques, but also excavate a range of evaluation metrics and collate them as measures of different properties of model explanations. Along the insightful discussion on the current trends, we also discuss the challenges and future avenues for this research direction.
    Learning Generalized Hybrid Proximity Representation for Image Recognition. (arXiv:2301.13459v1 [cs.CV])
    Recently, deep metric learning techniques received attention, as the learned distance representations are useful to capture the similarity relationship among samples and further improve the performance of various of supervised or unsupervised learning tasks. We propose a novel supervised metric learning method that can learn the distance metrics in both geometric and probabilistic space for image recognition. In contrast to the previous metric learning methods which usually focus on learning the distance metrics in Euclidean space, our proposed method is able to learn better distance representation in a hybrid approach. To achieve this, we proposed a Generalized Hybrid Metric Loss (GHM-Loss) to learn the general hybrid proximity features from the image data by controlling the trade-off between geometric proximity and probabilistic proximity. To evaluate the effectiveness of our method, we first provide theoretical derivations and proofs of the proposed loss function, then we perform extensive experiments on two public datasets to show the advantage of our method compared to other state-of-the-art metric learning methods.
    Convolutional autoencoder for the spatiotemporal latent representation of turbulence. (arXiv:2301.13728v1 [physics.flu-dyn])
    Turbulence is characterised by chaotic dynamics and a high-dimensional state space, which make the phenomenon challenging to predict. However, turbulent flows are often characterised by coherent spatiotemporal structures, such as vortices or large-scale modes, which can help obtain a latent description of turbulent flows. However, current approaches are often limited by either the need to use some form of thresholding on quantities defining the isosurfaces to which the flow structures are associated or the linearity of traditional modal flow decomposition approaches, such as those based on proper orthogonal decomposition. This problem is exacerbated in flows that exhibit extreme events, which are rare and sudden changes in a turbulent state. The goal of this paper is to obtain an efficient and accurate reduced-order latent representation of a turbulent flow that exhibits extreme events. Specifically, we employ a three-dimensional multiscale convolutional autoencoder (CAE) to obtain such latent representation. We apply it to a three-dimensional turbulent flow. We show that the Multiscale CAE is efficient, requiring less than 10% degrees of freedom than proper orthogonal decomposition for compressing the data and is able to accurately reconstruct flow states related to extreme events. The proposed deep learning architecture opens opportunities for nonlinear reduced-order modeling of turbulent flows from data.
    Skill Decision Transformer. (arXiv:2301.13573v1 [cs.LG])
    Recent work has shown that Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL) by representing the traditional RL problem as a sequence modelling problem (Chen et al., 2021; Janner et al., 2021). However many of these methods only optimize for high returns, and may not extract much information from a diverse dataset of trajectories. Generalized Decision Transformers (GDTs) (Furuta et al., 2021) have shown that utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data. Building upon this, we propose Skill Decision Transformer (Skill DT). Skill DT draws inspiration from hindsight relabelling (Andrychowicz et al., 2017) and skill discovery methods to discover a diverse set of primitive behaviors, or skills. We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled. Furthermore, we show that through purely reward-free optimization, Skill DT is still competitive with supervised offline RL approaches on the D4RL benchmark. The code and videos can be found on our project page: https://github.com/shyamsn97/skill-dt
    On the Initialisation of Wide Low-Rank Feedforward Neural Networks. (arXiv:2301.13710v1 [stat.ML])
    The edge-of-chaos dynamics of wide randomly initialized low-rank feedforward networks are analyzed. Formulae for the optimal weight and bias variances are extended from the full-rank to low-rank setting and are shown to follow from multiplicative scaling. The principle second order effect, the variance of the input-output Jacobian, is derived and shown to increase as the rank to width ratio decreases. These results inform practitioners how to randomly initialize feedforward networks with a reduced number of learnable parameters while in the same ambient dimension, allowing reductions in the computational cost and memory constraints of the associated network.
    Mathematical Capabilities of ChatGPT. (arXiv:2301.13867v1 [cs.LG])
    We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as well as hand-crafted ones, and measuring its performance against other models trained on a mathematical corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional mathematicians by emulating various use cases that come up in the daily professional activities of mathematicians (question answering, theorem searching). In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, only cover elementary mathematics. We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new dataset publicly available to assist a community-driven comparison of ChatGPT with (future) large language models in terms of advanced mathematical comprehension. We conclude that contrary to many positive reports in the media (a potential case of selection bias), ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Our results show that ChatGPT often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to pass a university exam, you would be better off copying from your average peer!
    Large Music Recommendation Studies for Small Teams. (arXiv:2301.13388v1 [cs.HC])
    Running live music recommendation studies without direct industry partnerships can be a prohibitively daunting task, especially for small teams. In order to help future researchers interested in such evaluations, we present a number of struggles we faced in the process of generating our own such evaluation system alongside potential solutions. These problems span the topics of users, data, computation, and application architecture.
    Few-Shot Image-to-Semantics Translation for Policy Transfer in Reinforcement Learning. (arXiv:2301.13343v1 [cs.LG])
    We investigate policy transfer using image-to-semantics translation to mitigate learning difficulties in vision-based robotics control agents. This problem assumes two environments: a simulator environment with semantics, that is, low-dimensional and essential information, as the state space, and a real-world environment with images as the state space. By learning mapping from images to semantics, we can transfer a policy, pre-trained in the simulator, to the real world, thereby eliminating real-world on-policy agent interactions to learn, which are costly and risky. In addition, using image-to-semantics mapping is advantageous in terms of the computational efficiency to train the policy and the interpretability of the obtained policy over other types of sim-to-real transfer strategies. To tackle the main difficulty in learning image-to-semantics mapping, namely the human annotation cost for producing a training dataset, we propose two techniques: pair augmentation with the transition function in the simulator environment and active learning. We observed a reduction in the annotation cost without a decline in the performance of the transfer, and the proposed approach outperformed the existing approach without annotation.
    A Scalable, Interpretable, Verifiable & Differentiable Logic Gate Convolutional Neural Network Architecture From Truth Tables. (arXiv:2208.08609v2 [cs.AI] UPDATED)
    We propose $\mathcal{T}$ruth $\mathcal{T}$able net ($\mathcal{TT}$net), a novel Convolutional Neural Network (CNN) architecture that addresses, by design, the open challenges of interpretability, formal verification, and logic gate conversion. $\mathcal{TT}$net is built using CNNs' filters that are equivalent to tractable truth tables and that we call Learning Truth Table (LTT) blocks. The dual form of LTT blocks allows the truth tables to be easily trained with gradient descent and makes these CNNs easy to interpret, verify and infer. Specifically, $\mathcal{TT}$net is a deep CNN model that can be automatically represented, after post-training transformation, as a sum of Boolean decision trees, or as a sum of Disjunctive/Conjunctive Normal Form (DNF/CNF) formulas, or as a compact Boolean logic circuit. We demonstrate the effectiveness and scalability of $\mathcal{TT}$net on multiple datasets, showing comparable interpretability to decision trees, fast complete/sound formal verification, and scalable logic gate representation, all compared to state-of-the-art methods. We believe this work represents a step towards making CNNs more transparent and trustworthy for real-world critical applications.
    Self-Compressing Neural Networks. (arXiv:2301.13142v2 [cs.LG] UPDATED)
    This work focuses on reducing neural network size, which is a major driver of neural network execution time, power consumption, bandwidth, and memory footprint. A key challenge is to reduce size in a manner that can be exploited readily for efficient training and inference without the need for specialized hardware. We propose Self-Compression: a simple, general method that simultaneously achieves two goals: (1) removing redundant weights, and (2) reducing the number of bits required to represent the remaining weights. This is achieved using a generalized loss function to minimize overall network size. In our experiments we demonstrate floating point accuracy with as few as 3% of the bits and 18% of the weights remaining in the network.
    A Reinforcement Learning Framework for Dynamic Mediation Analysis. (arXiv:2301.13348v1 [stat.ML])
    Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset.
    Clustering the Sketch: A Novel Approach to Embedding Table Compression. (arXiv:2210.05974v2 [cs.LG] UPDATED)
    Embedding tables are used by machine learning systems to work with categorical features. These tables can become exceedingly large in modern recommendation systems, necessitating the development of new methods for fitting them in memory, even during training. The best previous methods for table compression are so called "post training" quantization schemes such as "product" and "residual" quantization (Gray & Neuhoff, 1998). These methods replace table rows with references to k-means clustered "codewords". Unfortunately, clustering requires prior knowledge of the table to be compressed, which limits the memory savings to inference time and not training time. Hence, recent work, like the QR method (Shi et al., 2020), has used random references (linear sketching), which can be computed with hash functions before training. Unfortunately, the compression achieved is inferior to that achieved by post-training quantization. The new algorithm, CQR, shows how to get the best of two worlds by combining clustering and sketching: First IDs are randomly assigned to a codebook and codewords are trained (end to end) for an epoch. Next, we expand the codebook and apply clustering to reduce the size again. Finally, we add new random references and continue training. We show experimentally close to those of post-training quantization with the training time memory reductions of sketch-based methods, and we prove that our method always converges to the optimal embedding table for least-squares training.
    The Fair Value of Data Under Heterogeneous Privacy Constraints. (arXiv:2301.13336v1 [cs.LG])
    Modern data aggregation often takes the form of a platform collecting data from a network of users. More than ever, these users are now requesting that the data they provide is protected with a guarantee of privacy. This has led to the study of optimal data acquisition frameworks, where the optimality criterion is typically the maximization of utility for the agent trying to acquire the data. This involves determining how to allocate payments to users for the purchase of their data at various privacy levels. The main goal of this paper is to characterize a fair amount to pay users for their data at a given privacy level. We propose an axiomatic definition of fairness, analogous to the celebrated Shapley value. Two concepts for fairness are introduced. The first treats the platform and users as members of a common coalition and provides a complete description of how to divide the utility among the platform and users. In the second concept, fairness is defined only among users, leading to a potential fairness-constrained mechanism design problem for the platform. We consider explicit examples involving private heterogeneous data and show how these notions of fairness can be applied. To the best of our knowledge, these are the first fairness concepts for data that explicitly consider privacy constraints.
    Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference. (arXiv:2301.13330v1 [cs.LG])
    For effective and efficient deep neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. It is generally desirable to quantize as aggressively as possible without incurring significant accuracy degradation. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers of a network to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50 and ResNet-101 classification networks, demonstrating improved performance across the entire accuracy-throughput frontier, and equivalent performance for the PSPNet segmentation network in our own commensurate comparison over leading mixed precision layer selection techniques, while requiring orders of magnitude less compute time to reach a solution.
    Automated Sentiment and Hate Speech Analysis of Facebook Data by Employing Multilingual Transformer Models. (arXiv:2301.13668v1 [cs.CL])
    In recent years, there has been a heightened consensus within academia and in the public discourse that Social Media Platforms (SMPs), amplify the spread of hateful and negative sentiment content. Researchers have identified how hateful content, political propaganda, and targeted messaging contributed to real-world harms including insurrections against democratically elected governments, genocide, and breakdown of social cohesion due to heightened negative discourse towards certain communities in parts of the world. To counter these issues, SMPs have created semi-automated systems that can help identify toxic speech. In this paper we analyse the statistical distribution of hateful and negative sentiment contents within a representative Facebook dataset (n= 604,703) scrapped through 648 public Facebook pages which identify themselves as proponents (and followers) of far-right Hindutva actors. These pages were identified manually using keyword searches on Facebook and on CrowdTangleand classified as far-right Hindutva pages based on page names, page descriptions, and discourses shared on these pages. We employ state-of-the-art, open-source XLM-T multilingual transformer-based language models to perform sentiment and hate speech analysis of the textual contents shared on these pages over a period of 5.5 years. The result shows the statistical distributions of the predicted sentiment and the hate speech labels; top actors, and top page categories. We further discuss the benchmark performances and limitations of these pre-trained language models.
    Quantifying and Managing Impacts of Concept Drifts on IoT Traffic Inference in Residential ISP Networks. (arXiv:2301.06695v2 [cs.LG] UPDATED)
    Millions of vulnerable consumer IoT devices in home networks are the enabler for cyber crimes putting user privacy and Internet security at risk. Internet service providers (ISPs) are best poised to play key roles in mitigating risks by automatically inferring active IoT devices per household and notifying users of vulnerable ones. Developing a scalable inference method that can perform robustly across thousands of home networks is a non-trivial task. This paper focuses on the challenges of developing and applying data-driven inference models when labeled data of device behaviors is limited and the distribution of data changes (concept drift) across time and space domains. Our contributions are three-fold: (1) We collect and analyze network traffic of 24 types of consumer IoT devices from 12 real homes over six weeks to highlight the challenge of temporal and spatial concept drifts in network behavior of IoT devices; (2) We analyze the performance of two inference strategies, namely "global inference" (a model trained on a combined set of all labeled data from training homes) and "contextualized inference" (several models each trained on the labeled data from a training home) in the presence of concept drifts; and (3) To manage concept drifts, we develop a method that dynamically applies the ``closest'' model (from a set) to network traffic of unseen homes during the testing phase, yielding better performance in 20% of scenarios.
    A Unified Causal View of Domain Invariant Representation Learning. (arXiv:2208.06987v3 [stat.ML] UPDATED)
    Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. There are a wide range of proposals for mitigating this problem by learning representations that are ``invariant'' in some sense.However, these methods generally contradict each other, and none of them consistently improve performance on real-world domain shift benchmarks. There are two main questions that must be addressed to understand when, if ever, we should use each method. First, how does each ad hoc notion of ``invariance'' relate to the structure of real-world problems? And, second, when does learning invariant representations actually yield robust models? To address these issues, we introduce a broad formal notion of what it means for a real-world domain shift to admit invariant structure. Then, we characterize the causal structures that are compatible with this notion of invariance.With this in hand, we find conditions under which method-specific invariance notions correspond to real-world invariant structure, and we clarify the relationship between invariant structure and robustness to domain shifts. For both questions, we find that the true underlying causal structure of the data plays a critical role.
    Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning. (arXiv:2301.11916v1 [cs.CL] CROSS LISTED)
    In recent years, pre-trained large language models have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. The underlying mechanisms by which this capability arises from regular language model pretraining objectives remain poorly understood. In this study, we aim to examine the in-context learning phenomenon through a Bayesian lens, viewing large language models as topic models that implicitly infer task-related information from demonstrations. On this premise, we propose an algorithm for selecting optimal demonstrations from a set of annotated data and demonstrate a significant 12.5% improvement relative to the random selection baseline, averaged over eight GPT2 and GPT3 models on eight different real-world text classification datasets. Our empirical findings support our hypothesis that large language models implicitly infer a latent concept variable.
    Single-Loop Switching Subgradient Methods for Non-Smooth Weakly Convex Optimization with Non-Smooth Convex Constraints. (arXiv:2301.13314v1 [math.OC])
    In this paper, we consider a general non-convex constrained optimization problem, where the objective function is weakly convex and the constraint function is convex while they can both be non-smooth. This class of problems arises from many applications in machine learning such as fairness-aware supervised learning. To solve this problem, we consider the classical switching subgradient method by Polyak (1965), which is an intuitive and easily implementable first-order method. Before this work, its iteration complexity was only known for convex optimization. We prove its oracle complexity for finding a nearly stationary point when the objective function is non-convex. The analysis is derived separately when the constraint function is deterministic and stochastic. Compared to existing methods, especially the double-loop methods, the switching gradient method can be applied to non-smooth problems and only has a single loop, which saves the effort on tuning the number of inner iterations.
    Conversational Automated Program Repair. (arXiv:2301.13246v1 [cs.SE])
    Automated Program Repair (APR) can help developers automatically generate patches for bugs. Due to the impressive performance obtained using Large Pre-Trained Language Models (LLMs) on many code related tasks, researchers have started to directly use LLMs for APR. However, prior approaches simply repeatedly sample the LLM given the same constructed input/prompt created from the original buggy code, which not only leads to generating the same incorrect patches repeatedly but also miss the critical information in testcases. To address these limitations, we propose conversational APR, a new paradigm for program repair that alternates between patch generation and validation in a conversational manner. In conversational APR, we iteratively build the input to the model by combining previously generated patches with validation feedback. As such, we leverage the long-term context window of LLMs to not only avoid generating previously incorrect patches but also incorporate validation feedback to help the model understand the semantic meaning of the program under test. We evaluate 10 different LLM including the newly developed ChatGPT model to demonstrate the improvement of conversational APR over the prior LLM for APR approach.
    On the Statistical Benefits of Temporal Difference Learning. (arXiv:2301.13289v1 [cs.LG])
    Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.
    MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning. (arXiv:2301.13287v1 [cs.LG])
    Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consuming subset selection step, which involves computing model-dependent gradients and feature embeddings and applies greedy maximization of submodular objectives. Our key insight is that removing the reliance on downstream model parameters enables subset selection as a pre-processing step and enables one to train multiple models at no additional cost. In this work, we propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training while enabling superior model convergence and performance by using an easy-to-hard curriculum. Our empirical results indicate that MILO can train models $3\times - 10 \times$ faster and tune hyperparameters $20\times - 75 \times$ faster than full-dataset training or tuning without compromising performance.
    Temporal Consistency Loss for Physics-Informed Neural Networks. (arXiv:2301.13262v1 [physics.flu-dyn])
    Physics-informed neural networks (PINNs) have been widely used to solve partial differential equations in a forward and inverse manner using deep neural networks. However, training these networks can be challenging for multiscale problems. While statistical methods can be employed to scale the regression loss on data, it is generally challenging to scale the loss terms for equations. This paper proposes a method for scaling the mean squared loss terms in the objective function used to train PINNs. Instead of using automatic differentiation to calculate the temporal derivative, we use backward Euler discretization. This provides us with a scaling term for the equations. In this work, we consider the two and three-dimensional Navier-Stokes equations and determine the kinematic viscosity using the spatio-temporal data on the velocity and pressure fields. We first consider numerical datasets to test our method. We test the sensitivity of our method to the time step size, the number of timesteps, noise in the data, and spatial resolution. Finally, we use the velocity field obtained using Particle Image Velocimetry (PIV) experiments to generate a reference pressure field. We then test our framework using the velocity and reference pressure field.
    Online Loss Function Learning. (arXiv:2301.13247v1 [cs.LG])
    Loss function learning is a new meta-learning paradigm that aims to automate the essential task of designing a loss function for a machine learning model. Existing techniques for loss function learning have shown promising results, often improving a model's training dynamics and final inference performance. However, a significant limitation of these techniques is that the loss functions are meta-learned in an offline fashion, where the meta-objective only considers the very first few steps of training, which is a significantly shorter time horizon than the one typically used for training deep neural networks. This causes significant bias towards loss functions that perform well at the very start of training but perform poorly at the end of training. To address this issue we propose a new loss function learning technique for adaptively updating the loss function online after each update to the base model parameters. The experimental results show that our proposed method consistently outperforms the cross-entropy loss and offline loss function learning techniques on a diverse range of neural network architectures and datasets.
    SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Search. (arXiv:2301.13236v1 [cs.LG])
    Despite the popularity of policy gradient methods, they are known to suffer from large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax -- a generalization of softmax that takes planning into account. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We consider two variants of SoftTreeMax, one for cumulative reward and one for exponentiated reward. For both, we analyze the gradient variance and reveal for the first time the role of a tree expansion policy in mitigating this variance. We prove that the resulting variance decays exponentially with the planning horizon as a function of the expansion policy. Specifically, we show that the closer the resulting state transitions are to uniform, the faster the decay. In a practical implementation, we utilize a parallelized GPU-based simulator for fast and efficient tree search. Our differentiable tree-based policy leverages all gradients at the tree leaves in each environment step instead of the traditional single-sample-based gradient. We then show in simulation how the variance of the gradient is reduced by three orders of magnitude, leading to better sample complexity compared to the standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in a faster run time compared to distributed PPO. Lastly, we demonstrate that high reward correlates with lower variance.
    Probabilistic Neural Data Fusion for Learning from an Arbitrary Number of Multi-fidelity Data Sets. (arXiv:2301.13271v1 [cs.LG])
    In many applications in engineering and sciences analysts have simultaneous access to multiple data sources. In such cases, the overall cost of acquiring information can be reduced via data fusion or multi-fidelity (MF) modeling where one leverages inexpensive low-fidelity (LF) sources to reduce the reliance on expensive high-fidelity (HF) data. In this paper, we employ neural networks (NNs) for data fusion in scenarios where data is very scarce and obtained from an arbitrary number of sources with varying levels of fidelity and cost. We introduce a unique NN architecture that converts MF modeling into a nonlinear manifold learning problem. Our NN architecture inversely learns non-trivial (e.g., non-additive and non-hierarchical) biases of the LF sources in an interpretable and visualizable manifold where each data source is encoded via a low-dimensional distribution. This probabilistic manifold quantifies model form uncertainties such that LF sources with small bias are encoded close to the HF source. Additionally, we endow the output of our NN with a parametric distribution not only to quantify aleatoric uncertainties, but also to reformulate the network's loss function based on strictly proper scoring rules which improve robustness and accuracy on unseen HF data. Through a set of analytic and engineering examples, we demonstrate that our approach provides a high predictive power while quantifying various sources uncertainties.
    Retrosynthetic Planning with Dual Value Networks. (arXiv:2301.13755v1 [cs.AI])
    Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes while retaining single-step accuracy. Desirable routes should be both synthesizable and of low cost. We propose an online training algorithm, called Planning with Dual Value Networks (PDVN), in which two value networks predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Furthermore, PDVN finds shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph).
    Emergence of Maps in the Memories of Blind Navigation Agents. (arXiv:2301.13261v1 [cs.AI])
    Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to $\Delta$ x, $\Delta$ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.
    Interpreting Robustness Proofs of Deep Neural Networks. (arXiv:2301.13845v1 [cs.LG])
    In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs behavior, it is not clear whether the proofs generated by these methods are human-interpretable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable interpretations of the proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely on spurious input features, while the proofs of DNNs trained to be provably robust filter out even the semantically meaningful features. The proofs for the DNNs combining adversarial and provably robust training are the most effective at selectively filtering out spurious features as well as relying on human-understandable input features.
    Contextual Pandora's Box. (arXiv:2205.13114v2 [cs.LG] UPDATED)
    Pandora's Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate distributions are given for the values of all the alternatives, while recent work studies the online variant of Pandora's Box where the distributions are originally unknown. In this work, we study Pandora's Box in the online setting, while incorporating context. At every round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well to the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is a novel modification of the realizability condition in contextual bandits that connects a context to a sufficient statistic of each alternative's distribution (its "reservation value") rather than its mean.
    Multi-fidelity covariance estimation in the log-Euclidean geometry. (arXiv:2301.13749v1 [stat.CO])
    We introduce a multi-fidelity estimator of covariance matrices that employs the log-Euclidean geometry of the symmetric positive-definite manifold. The estimator fuses samples from a hierarchy of data sources of differing fidelities and costs for variance reduction while guaranteeing definiteness, in contrast with previous approaches. The new estimator makes covariance estimation tractable in applications where simulation or data collection is expensive; to that end, we develop an optimal sample allocation scheme that minimizes the mean-squared error of the estimator given a fixed budget. Guaranteed definiteness is crucial to metric learning, data assimilation, and other downstream tasks. Evaluations of our approach using data from physical applications (heat conduction, fluid dynamics) demonstrate more accurate metric learning and speedups of more than one order of magnitude compared to benchmarks.
    Personalized Subgraph Federated Learning. (arXiv:2206.10206v2 [cs.LG] UPDATED)
    Subgraphs of a larger global graph may be distributed across multiple devices, and only locally accessible due to privacy restrictions, although there may be links between subgraphs. Recently proposed subgraph Federated Learning (FL) methods deal with those missing links across local subgraphs while distributively training Graph Neural Networks (GNNs) on them. However, they have overlooked the inevitable heterogeneity between subgraphs comprising different communities of a global graph, consequently collapsing the incompatible knowledge from local GNN models. To this end, we introduce a new subgraph FL problem, personalized subgraph FL, which focuses on the joint improvement of the interrelated local GNNs rather than learning a single global model, and propose a novel framework, FEDerated Personalized sUBgraph learning (FED-PUB), to tackle it. Since the server cannot access the subgraph in each client, FED-PUB utilizes functional embeddings of the local GNNs using random graphs as inputs to compute similarities between them, and use the similarities to perform weighted averaging for server-side aggregation. Further, it learns a personalized sparse mask at each client to select and update only the subgraph-relevant subset of the aggregated parameters. We validate our FED-PUB for its subgraph FL performance on six datasets, considering both non-overlapping and overlapping subgraphs, on which it significantly outperforms relevant baselines.
    Preserving local densities in low-dimensional embeddings. (arXiv:2301.13732v1 [cs.LG])
    Low-dimensional embeddings and visualizations are an indispensable tool for analysis of high-dimensional data. State-of-the-art methods, such as tSNE and UMAP, excel in unveiling local structures hidden in high-dimensional data and are therefore routinely applied in standard analysis pipelines in biology. We show, however, that these methods fail to reconstruct local properties, such as relative differences in densities (Fig. 1) and that apparent differences in cluster size can arise from computational artifact caused by differing sample sizes (Fig. 2). Providing a theoretical analysis of this issue, we then suggest dtSNE, which approximately conserves local densities. In an extensive study on synthetic benchmark and real world data comparing against five state-of-the-art methods, we empirically show that dtSNE provides similar global reconstruction, but yields much more accurate depictions of local distances and relative densities.
    Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion. (arXiv:2301.13743v1 [cs.CV])
    Cross-modality data translation has attracted great interest in image computing. Deep generative models (\textit{e.g.}, GANs) show performance improvement in tackling those problems. Nevertheless, as a fundamental challenge in image translation, the problem of Zero-shot-Learning Cross-Modality Data Translation with fidelity remains unanswered. This paper proposes a new unsupervised zero-shot-learning method named Mutual Information guided Diffusion cross-modality data translation Model (MIDiffusion), which learns to translate the unseen source data to the target domain. The MIDiffusion leverages a score-matching-based generative model, which learns the prior knowledge in the target domain. We propose a differentiable local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling. The $LMI$ captures the identical cross-modality features in the statistical domain for the diffusion guidance; thus, our method does not require retraining when the source domain is changed, as it does not rely on any direct mapping between the source and target domains. This advantage is critical for applying cross-modality data translation methods in practice, as a reasonable amount of source domain dataset is not always available for supervised training. We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models, including adversarial-based and other score-matching-based models.
    Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments. (arXiv:2301.13446v1 [cs.LG])
    We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.
    Are Defenses for Graph Neural Networks Robust?. (arXiv:2301.13694v1 [cs.LG])
    A cursory reading of the literature suggests that we have made a lot of progress in designing effective adversarial defenses for Graph Neural Networks (GNNs). Yet, the standard methodology has a serious flaw - virtually all of the defenses are evaluated against non-adaptive attacks leading to overly optimistic robustness estimates. We perform a thorough robustness analysis of 7 of the most popular defenses spanning the entire spectrum of strategies, i.e., aimed at improving the graph, the architecture, or the training. The results are sobering - most defenses show no or only marginal improvement compared to an undefended baseline. We advocate using custom adaptive attacks as a gold standard and we outline the lessons we learned from successfully designing such attacks. Moreover, our diverse collection of perturbed graphs forms a (black-box) unit test offering a first glance at a model's robustness.
    Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-Evolutionary Search. (arXiv:2301.13807v1 [cs.SE])
    In Machine Learning (ML)-enabled autonomous systems (MLASs), it is essential to identify the hazard boundary of ML Components (MLCs) in the MLAS under analysis. Given that such boundary captures the conditions in terms of MLC behavior and system context that can lead to hazards, it can then be used to, for example, build a safety monitor that can take any predefined fallback mechanisms at runtime when reaching the hazard boundary. However, determining such hazard boundary for an ML component is challenging. This is due to the space combining system contexts (i.e., scenarios) and MLC behaviors (i.e., inputs and outputs) being far too large for exhaustive exploration and even to handle using conventional metaheuristics, such as genetic algorithms. Additionally, the high computational cost of simulations required to determine any MLAS safety violations makes the problem even more challenging. Furthermore, it is unrealistic to consider a region in the problem space deterministically safe or unsafe due to the uncontrollable parameters in simulations and the non-linear behaviors of ML models (e.g., deep neural networks) in the MLAS under analysis. To address the challenges, we propose MLCSHE (ML Component Safety Hazard Envelope), a novel method based on a Cooperative Co-Evolutionary Algorithm (CCEA), which aims to tackle a high-dimensional problem by decomposing it into two lower-dimensional search subproblems. Moreover, we take a probabilistic view of safe and unsafe regions and define a novel fitness function to measure the distance from the probabilistic hazard boundary and thus drive the search effectively. We evaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous Vehicle (AV) case study. Our evaluation results show that MLCSHE is significantly more effective and efficient compared to a standard genetic algorithm and random search.
    An $l_1$-oracle inequality for the Lasso in high-dimensional mixtures of experts models. (arXiv:2009.10622v5 [math.ST] UPDATED)
    Mixtures of experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.
    Video Influencers: Unboxing the Mystique. (arXiv:2012.12311v2 [cs.LG] UPDATED)
    Influencer marketing has become a very popular tool to reach customers. Despite the rapid growth in influencer videos, there has been little research on the effectiveness of their constituent elements in explaining video engagement. We study YouTube influencers and analyze their unstructured video data across text, audio and images using a novel "interpretable deep learning" framework that accomplishes both goals of prediction and interpretation. Our prediction-based approach analyzes unstructured data and finds that "what is said" in words (text) is more influential than "how it is said" in imagery (images) followed by acoustics (audio). Our interpretation-based approach is implemented after completion of model prediction by analyzing the same source of unstructured data to measure importance attributed to the video elements. We eliminate several spurious and confounded relationships, and identify a smaller subset of theory-based relationships. We uncover novel findings that establish distinct effects for measures of shallow and deep engagement which are based on the dual-system framework of human thinking. Our approach is validated using simulated data, and we discuss the learnings from our findings for influencers and brands.
    Unsupervised Music Source Separation Using Differentiable Parametric Source Models. (arXiv:2201.09592v2 [cs.SD] UPDATED)
    Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models' parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent.
    Auxiliary Learning as an Asymmetric Bargaining Game. (arXiv:2301.13501v1 [cs.LG])
    Auxiliary learning is an effective method for enhancing the generalization capabilities of trained models, particularly when dealing with small datasets. However, this approach may present several difficulties: (i) optimizing multiple objectives can be more challenging, and (ii) how to balance the auxiliary tasks to best assist the main task is unclear. In this work, we propose a novel approach, named AuxiNash, for balancing tasks in auxiliary learning by formalizing the problem as generalized bargaining game with asymmetric task bargaining power. Furthermore, we describe an efficient procedure for learning the bargaining power of tasks based on their contribution to the performance of the main task and derive theoretical guarantees for its convergence. Finally, we evaluate AuxiNash on multiple multi-task benchmarks and find that it consistently outperforms competing methods.
    An Analysis of Classification Approaches for Hit Song Prediction using Engineered Metadata Features with Lyrics and Audio Features. (arXiv:2301.13507v1 [cs.IR])
    Hit song prediction, one of the emerging fields in music information retrieval (MIR), remains a considerable challenge. Being able to understand what makes a given song a hit is clearly beneficial to the whole music industry. Previous approaches to hit song prediction have focused on using audio features of a record. This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata, including song audio features provided by Spotify, song lyrics, and novel metadata-based features (title topic, popularity continuity and genre class). Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron. Our results show that Random Forest (RF) and Logistic Regression (LR) with all features (including novel features, song audio features and lyrics features) outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively. Our findings also demonstrate the utility of our novel music metadata features, which contributed most to the models' discriminative performance.
    NP-Match: Towards a New Probabilistic Model for Semi-Supervised Learning. (arXiv:2301.13569v1 [cs.CV])
    Semi-supervised learning (SSL) has been widely explored in recent years, and it is an effective way of leveraging unlabeled data to reduce the reliance on labeled data. In this work, we adjust neural processes (NPs) to the semi-supervised image classification task, resulting in a new method named NP-Match. NP-Match is suited to this task for two reasons. Firstly, NP-Match implicitly compares data points when making predictions, and as a result, the prediction of each unlabeled data point is affected by the labeled data points that are similar to it, which improves the quality of pseudo-labels. Secondly, NP-Match is able to estimate uncertainty that can be used as a tool for selecting unlabeled samples with reliable pseudo-labels. Compared with uncertainty-based SSL methods implemented with Monte-Carlo (MC) dropout, NP-Match estimates uncertainty with much less computational overhead, which can save time at both the training and the testing phases. We conducted extensive experiments on five public datasets under three semi-supervised image classification settings, namely, the standard semi-supervised image classification, the imbalanced semi-supervised image classification, and the multi-label semi-supervised image classification, and NP-Match outperforms state-of-the-art (SOTA) approaches or achieves competitive results on them, which shows the effectiveness of NP-Match and its potential for SSL. The codes are at https://github.com/Jianf-Wang/NP-Match
    Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits. (arXiv:2301.13393v1 [cs.LG])
    Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
    Recurrences reveal shared causal drivers of complex time series. (arXiv:2301.13516v1 [cs.LG])
    Many experimental time series measurements share an unobserved causal driver. Examples include genes targeted by transcription factors, ocean flows influenced by large-scale atmospheric currents, and motor circuits steered by descending neurons. Reliably inferring this unseen driving force is necessary to understand the intermittent nature of top-down control schemes in diverse biological and engineered systems. Here, we introduce a new unsupervised learning algorithm that uses recurrences in time series measurements to gradually reconstruct an unobserved driving signal. Drawing on the mathematical theory of skew-product dynamical systems, we identify recurrence events shared across response time series, which implicitly define a recurrence graph with glass-like structure. As the amount or quality of observed data improves, this recurrence graph undergoes a percolation transition manifesting as weak ergodicity breaking for random walks on the induced landscape -- revealing the shared driver's dynamics, even in the presence of strongly corrupted or noisy measurements. Across several thousand random dynamical systems, we empirically quantify the dependence of reconstruction accuracy on the rate of information transfer from a chaotic driver to the response systems, and we find that effective reconstruction proceeds through gradual approximation of the driver's dominant unstable periodic orbits. Through extensive benchmarks against classical and neural-network-based signal processing techniques, we demonstrate our method's strong ability to extract causal driving signals from diverse real-world datasets spanning neuroscience, genomics, fluid dynamics, and physiology.
    Tricking AI chips into Simulating the Human Brain: A Detailed Performance Analysis. (arXiv:2301.13637v1 [cs.LG])
    Challenging the Nvidia monopoly, dedicated AI-accelerator chips have begun emerging for tackling the computational challenge that the inference and, especially, the training of modern deep neural networks (DNNs) poses to modern computers. The field has been ridden with studies assessing the performance of these contestants across various DNN model types. However, AI-experts are aware of the limitations of current DNNs and have been working towards the fourth AI wave which will, arguably, rely on more biologically inspired models, predominantly on spiking neural networks (SNNs). At the same time, GPUs have been heavily used for simulating such models in the field of computational neuroscience, yet AI-chips have not been tested on such workloads. The current paper aims at filling this important gap by evaluating multiple, cutting-edge AI-chips (Graphcore IPU, GroqChip, Nvidia GPU with Tensor Cores and Google TPU) on simulating a highly biologically detailed model of a brain region, the inferior olive (IO). This IO application stress-tests the different AI-platforms for highlighting architectural tradeoffs by varying its compute density, memory requirements and floating-point numerical accuracy. Our performance analysis reveals that the simulation problem maps extremely well onto the GPU and TPU architectures, which for networks of 125,000 cells leads to a 28x respectively 1,208x speedup over CPU runtimes. At this speed, the TPU sets a new record for largest real-time IO simulation. The GroqChip outperforms both platforms for small networks but, due to implementing some floating-point operations at reduced accuracy, is found not yet usable for brain simulation.
    V2N Service Scaling with Deep Reinforcement Learning. (arXiv:2301.13324v1 [cs.LG])
    The fifth generation (5G) of wireless networks is set out to meet the stringent requirements of vehicular use cases. Edge computing resources can aid in this direction by moving processing closer to end-users, reducing latency. However, given the stochastic nature of traffic loads and availability of physical resources, appropriate auto-scaling mechanisms need to be employed to support cost-efficient and performant services. To this end, we employ Deep Reinforcement Learning (DRL) for vertical scaling in Edge computing to support vehicular-to-network communications. We address the problem using Deep Deterministic Policy Gradient (DDPG). As DDPG is a model-free off-policy algorithm for learning continuous actions, we introduce a discretization approach to support discrete scaling actions. Thus we address scalability problems inherent to high-dimensional discrete action spaces. Employing a real-world vehicular trace data set, we show that DDPG outperforms existing solutions, reducing (at minimum) the average number of active CPUs by 23% while increasing the long-term reward by 24%.
    When Source-Free Domain Adaptation Meets Learning with Noisy Labels. (arXiv:2301.13381v1 [cs.LG])
    Recent state-of-the-art source-free domain adaptation (SFDA) methods have focused on learning meaningful cluster structures in the feature space, which have succeeded in adapting the knowledge from source domain to unlabeled target domain without accessing the private source data. However, existing methods rely on the pseudo-labels generated by source models that can be noisy due to domain shift. In this paper, we study SFDA from the perspective of learning with label noise (LLN). Unlike the label noise in the conventional LLN scenario, we prove that the label noise in SFDA follows a different distribution assumption. We also prove that such a difference makes existing LLN methods that rely on their distribution assumptions unable to address the label noise in SFDA. Empirical evidence suggests that only marginal improvements are achieved when applying the existing LLN methods to solve the SFDA problem. On the other hand, although there exists a fundamental difference between the label noise in the two scenarios, we demonstrate theoretically that the early-time training phenomenon (ETP), which has been previously observed in conventional label noise settings, can also be observed in the SFDA problem. Extensive experiments demonstrate significant improvements to existing SFDA algorithms by leveraging ETP to address the label noise in SFDA.
    Automated Time-frequency Domain Audio Crossfades using Graph Cuts. (arXiv:2301.13380v1 [cs.SD])
    The problem of transitioning smoothly from one audio clip to another arises in many music consumption scenarios, especially as music consumption has moved from professionally curated and live-streamed radios to personal playback devices and services. we present the first steps toward a new method of automatically transitioning from one audio clip to another by discretizing the frequency spectrum into bins and then finding transition times for each bin. We phrase the problem as one of graph flow optimization; specifically min-cut/max-flow.
    Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation. (arXiv:2301.13389v1 [cs.LG])
    While it is tempting to believe that data distillation preserves privacy, distilled data's empirical robustness against known attacks does not imply a provable privacy guarantee. Here, we develop a provably privacy-preserving data distillation algorithm, called differentially private kernel inducing points (DP-KIP). DP-KIP is an instantiation of DP-SGD on kernel ridge regression (KRR). Following a recent work, we use neural tangent kernels and minimize the KRR loss to estimate the distilled datapoints (i.e., kernel inducing points). We provide a computationally efficient JAX implementation of DP-KIP, which we test on several popular image and tabular datasets to show its efficacy in data distillation with differential privacy guarantees.
    Optimizing DDPM Sampling with Shortcut Fine-Tuning. (arXiv:2301.13362v1 [cs.LG])
    In this study, we propose Shortcut Fine-tuning (SFT), a new approach for addressing the challenge of fast sampling of pretrained Denoising Diffusion Probabilistic Models (DDPMs). SFT advocates for the fine-tuning of DDPM samplers through the direct minimization of Integral Probability Metrics (IPM), instead of learning the backward diffusion process. This enables samplers to discover an alternative and more efficient sampling shortcut, deviating from the backward diffusion process. We also propose a new algorithm that is similar to the policy gradient method for fine-tuning DDPMs by proving that under certain assumptions, the gradient descent of diffusion models is equivalent to the policy gradient approach. Through empirical evaluation, we demonstrate that our fine-tuning method can further enhance existing fast DDPM samplers, resulting in sample quality comparable to or even surpassing that of the full-step model across various datasets.
    Misspecification-robust Sequential Neural Likelihood. (arXiv:2301.13368v1 [stat.ME])
    Simulation-based inference (SBI) techniques are now an essential tool for the parameter estimation of mechanistic and simulatable models with intractable likelihoods. Statistical approaches to SBI such as approximate Bayesian computation and Bayesian synthetic likelihood have been well studied in the well specified and misspecified settings. However, most implementations are inefficient in that many model simulations are wasted. Neural approaches such as sequential neural likelihood (SNL) have been developed that exploit all model simulations to build a surrogate of the likelihood function. However, SNL approaches have been shown to perform poorly under model misspecification. In this paper, we develop a new method for SNL that is robust to model misspecification and can identify areas where the model is deficient. We demonstrate the usefulness of the new approach on several illustrative examples.
    Self-Consistent Velocity Matching of Probability Flows. (arXiv:2301.13737v1 [cs.LG])
    We present a discretization-free scalable framework for solving a large class of mass-conserving partial differential equations (PDEs), including the time-dependent Fokker-Planck equation and the Wasserstein gradient flow. The main observation is that the time-varying velocity field of the PDE solution needs to be self-consistent: it must satisfy a fixed-point equation involving the flow characterized by the same velocity field. By parameterizing the flow as a time-dependent neural network, we propose an end-to-end iterative optimization framework called self-consistent velocity matching to solve this class of PDEs. Compared to existing approaches, our method does not suffer from temporal or spatial discretization, covers a wide range of PDEs, and scales to high dimensions. Experimentally, our method recovers analytical solutions accurately when they are available and achieves comparable or better performance in high dimensions with less training time compared to recent large-scale JKO-based methods that are designed for solving a more restrictive family of PDEs.
    Near Optimal Private and Robust Linear Regression. (arXiv:2301.13273v1 [cs.LG])
    We study the canonical statistical estimation problem of linear regression from $n$ i.i.d.~examples under $(\varepsilon,\delta)$-differential privacy when some response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves a near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to guarantee both $(\varepsilon,\delta)$-DP and robustness. Synthetic experiments confirm the superiority of our approach.
    Learning Coordination Policies over Heterogeneous Graphs for Human-Robot Teams via Recurrent Neural Schedule Propagation. (arXiv:2301.13279v1 [cs.AI])
    As human-robot collaboration increases in the workforce, it becomes essential for human-robot teams to coordinate efficiently and intuitively. Traditional approaches for human-robot scheduling either utilize exact methods that are intractable for large-scale problems and struggle to account for stochastic, time varying human task performance, or application-specific heuristics that require expert domain knowledge to develop. We propose a deep learning-based framework, called HybridNet, combining a heterogeneous graph-based encoder with a recurrent schedule propagator for scheduling stochastic human-robot teams under upper- and lower-bound temporal constraints. The HybridNet's encoder leverages Heterogeneous Graph Attention Networks to model the initial environment and team dynamics while accounting for the constraints. By formulating task scheduling as a sequential decision-making process, the HybridNet's recurrent neural schedule propagator leverages Long Short-Term Memory (LSTM) models to propagate forward consequences of actions to carry out fast schedule generation, removing the need to interact with the environment between every task-agent pair selection. The resulting scheduling policy network provides a computationally lightweight yet highly expressive model that is end-to-end trainable via Reinforcement Learning algorithms. We develop a virtual task scheduling environment for mixed human-robot teams in a multi-round setting, capable of modeling the stochastic learning behaviors of human workers. Experimental results showed that HybridNet outperformed other human-robot scheduling solutions across problem sizes for both deterministic and stochastic human performance, with faster runtime compared to pure-GNN-based schedulers.  ( 2 min )
    Sifer: Overcoming simplicity bias in deep networks using a feature sieve. (arXiv:2301.13293v1 [cs.LG])
    Simplicity bias is the concerning tendency of deep networks to over-depend on simple, weakly predictive features, to the exclusion of stronger, more complex features. This causes biased, incorrect model predictions in many real-world applications, exacerbated by incomplete training data containing spurious feature-label correlations. We propose a direct, interventional method for addressing simplicity bias in DNNs, which we call the feature sieve. We aim to automatically identify and suppress easily-computable spurious features in lower layers of the network, thereby allowing the higher network levels to extract and utilize richer, more meaningful representations. We provide concrete evidence of this differential suppression & enhancement of relevant features on both controlled datasets and real-world images, and report substantial gains on many real-world debiasing benchmarks (11.4% relative gain on Imagenet-A; 3.2% on BAR, etc). Crucially, we outperform many baselines that incorporate knowledge about known spurious or biased attributes, despite our method not using any such information. We believe that our feature sieve work opens up exciting new research directions in automated adversarial feature extraction & representation learning for deep networks.  ( 2 min )
  • Open

    Preserving local densities in low-dimensional embeddings. (arXiv:2301.13732v1 [cs.LG])
    Low-dimensional embeddings and visualizations are an indispensable tool for analysis of high-dimensional data. State-of-the-art methods, such as tSNE and UMAP, excel in unveiling local structures hidden in high-dimensional data and are therefore routinely applied in standard analysis pipelines in biology. We show, however, that these methods fail to reconstruct local properties, such as relative differences in densities (Fig. 1) and that apparent differences in cluster size can arise from computational artifact caused by differing sample sizes (Fig. 2). Providing a theoretical analysis of this issue, we then suggest dtSNE, which approximately conserves local densities. In an extensive study on synthetic benchmark and real world data comparing against five state-of-the-art methods, we empirically show that dtSNE provides similar global reconstruction, but yields much more accurate depictions of local distances and relative densities.  ( 2 min )
    Bayesian Bilinear Neural Network for Predicting the Mid-price Dynamics in Limit-Order Book Markets. (arXiv:2203.03613v2 [econ.EM] UPDATED)
    The prediction of financial markets is a challenging yet important task. In modern electronically-driven markets, traditional time-series econometric methods often appear incapable of capturing the true complexity of the multi-level interactions driving the price dynamics. While recent research has established the effectiveness of traditional machine learning (ML) models in financial applications, their intrinsic inability to deal with uncertainties, which is a great concern in econometrics research and real business applications, constitutes a major drawback. Bayesian methods naturally appear as a suitable remedy conveying the predictive ability of ML methods with the probabilistically-oriented practice of econometric research. By adopting a state-of-the-art second-order optimization algorithm, we train a Bayesian bilinear neural network with temporal attention, suitable for the challenging time-series task of predicting mid-price movements in ultra-high-frequency limit-order book markets. We thoroughly compare our Bayesian model with traditional ML alternatives by addressing the use of predictive distributions to analyze errors and uncertainties associated with the estimated parameters and model forecasts. Our results underline the feasibility of the Bayesian deep-learning approach and its predictive and decisional advantages in complex econometric tasks, prompting future research in this direction.
    Learning from many trajectories. (arXiv:2203.17193v2 [cs.LG] UPDATED)
    We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
    Learning Generalized Hybrid Proximity Representation for Image Recognition. (arXiv:2301.13459v1 [cs.CV])
    Recently, deep metric learning techniques received attention, as the learned distance representations are useful to capture the similarity relationship among samples and further improve the performance of various of supervised or unsupervised learning tasks. We propose a novel supervised metric learning method that can learn the distance metrics in both geometric and probabilistic space for image recognition. In contrast to the previous metric learning methods which usually focus on learning the distance metrics in Euclidean space, our proposed method is able to learn better distance representation in a hybrid approach. To achieve this, we proposed a Generalized Hybrid Metric Loss (GHM-Loss) to learn the general hybrid proximity features from the image data by controlling the trade-off between geometric proximity and probabilistic proximity. To evaluate the effectiveness of our method, we first provide theoretical derivations and proofs of the proposed loss function, then we perform extensive experiments on two public datasets to show the advantage of our method compared to other state-of-the-art metric learning methods.
    Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback. (arXiv:2301.13791v1 [stat.ML])
    We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
    A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser. (arXiv:2301.13731v1 [stat.ML])
    This paper presents a new convergent Plug-and-Play (PnP) algorithm. PnP methods are efficient iterative algorithms for solving image inverse problems formulated as the minimization of the sum of a data-fidelity term and a regularization term. PnP methods perform regularization by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD). To ensure convergence of PnP schemes, many works study specific parametrizations of deep denoisers. However, existing results require either unverifiable or suboptimal hypotheses on the denoiser, or assume restrictive conditions on the parameters of the inverse problem. Observing that these limitations can be due to the proximal algorithm in use, we study a relaxed version of the PGD algorithm for minimizing the sum of a convex function and a weakly convex one. When plugged with a relaxed proximal denoiser, we show that the proposed PnP-$\alpha$PGD algorithm converges for a wider range of regularization parameters, thus allowing more accurate image restoration.
    Stabilize Deep ResNet with A Sharp Scaling Factor $\tau$. (arXiv:1903.07120v5 [cs.LG] UPDATED)
    We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $\tau =O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $\tau>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $\tau$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $\tau$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
    Real-Time Outlier Detection with Dynamic Process Limits. (arXiv:2301.13527v1 [cs.LG])
    Anomaly detection methods are part of the systems where rare events may endanger an operation's profitability, safety, and environmental aspects. Although many state-of-the-art anomaly detection methods were developed to date, their deployment is limited to the operation conditions present during the model training. Online anomaly detection brings the capability to adapt to data drifts and change points that may not be represented during model development resulting in prolonged service life. This paper proposes an online anomaly detection algorithm for existing real-time infrastructures where low-latency detection is required and novel patterns in data occur unpredictably. The online inverse cumulative distribution-based approach is introduced to eliminate common problems of offline anomaly detectors, meanwhile providing dynamic process limits to normal operation. The benefit of the proposed method is the ease of use, fast computation, and deployability as shown in two case studies of real microgrid operation data.
    Discovery of Single Independent Latent Variable. (arXiv:2110.05887v2 [stat.ML] UPDATED)
    Latent variable discovery is a central problem in data analysis with a broad range of applications in applied science. In this work, we consider data given as an invertible mixture of two statistically independent components, and assume that one of the components is observed while the other is hidden. Our goal is to recover the hidden component. For this purpose, we propose an autoencoder equipped with a discriminator. Unlike the standard nonlinear ICA problem, which was shown to be non-identifiable, in the special case of ICA we consider here, we show that our approach can recover the component of interest up to entropy-preserving transformation. We demonstrate the performance of the proposed approach on several datasets, including image synthesis, voice cloning, and fetal ECG extraction.
    A Reinforcement Learning Framework for Dynamic Mediation Analysis. (arXiv:2301.13348v1 [stat.ML])
    Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset.
    Physics-constrained 3D Convolutional Neural Networks for Electrodynamics. (arXiv:2301.13715v1 [physics.acc-ph])
    We present a physics-constrained neural network (PCNN) approach to solving Maxwell's equations for the electromagnetic fields of intense relativistic charged particle beams. We create a 3D convolutional PCNN to map time-varying current and charge densities J(r,t) and p(r,t) to vector and scalar potentials A(r,t) and V(r,t) from which we generate electromagnetic fields according to Maxwell's equations: B=curl(A), E=-div(V)-dA/dt. Our PCNNs satisfy hard constraints, such as div(B)=0, by construction. Soft constraints push A and V towards satisfying the Lorenz gauge.
    What can be learnt with wide convolutional neural networks?. (arXiv:2208.01003v4 [stat.ML] UPDATED)
    Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.
    A Bias-Variance-Privacy Trilemma for Statistical Estimation. (arXiv:2301.13334v1 [math.ST])
    The canonical algorithm for differentially private mean estimation is to first clip the samples to a bounded range and then add noise to their empirical mean. Clipping controls the sensitivity and, hence, the variance of the noise that we add for privacy. But clipping also introduces statistical bias. We prove that this tradeoff is inherent: no algorithm can simultaneously have low bias, low variance, and low privacy loss for arbitrary distributions. On the positive side, we show that unbiased mean estimation is possible under approximate differential privacy if we assume that the distribution is symmetric. Furthermore, we show that, even if we assume that the data is sampled from a Gaussian, unbiased mean estimation is impossible under pure or concentrated differential privacy.
    Learning Data Representations with Joint Diffusion Models. (arXiv:2301.13622v1 [cs.LG])
    We introduce a joint diffusion model that simultaneously learns meaningful internal representations fit for both generative and predictive tasks. Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models in both generative and predictive settings. We then introduce an extension of the vanilla diffusion model with a classifier that allows for stable joint training with shared parametrization between those objectives. The resulting joint diffusion model offers superior performance across various tasks, including generative modeling, semi-supervised classification, and domain adaptation.
    Kernel Stein Discrepancy thinning: a theoretical perspective of pathologies and a practical fix with regularization. (arXiv:2301.13528v1 [math.ST])
    Stein thinning is a promising algorithm proposed by (Riabiz et al., 2022) for post-processing outputs of Markov chain Monte Carlo (MCMC). The main principle is to greedily minimize the kernelized Stein discrepancy (KSD), which only requires the gradient of the log-target distribution, and is thus well-suited for Bayesian inference. The main advantages of Stein thinning are the automatic remove of the burn-in period, the correction of the bias introduced by recent MCMC algorithms, and the asymptotic properties of convergence towards the target distribution. Nevertheless, Stein thinning suffers from several empirical pathologies, which may result in poor approximations, as observed in the literature. In this article, we conduct a theoretical analysis of these pathologies, to clearly identify the mechanisms at stake, and suggest improved strategies. Then, we introduce the regularized Stein thinning algorithm to alleviate the identified pathologies. Finally, theoretical guarantees and extensive experiments show the high efficiency of the proposed algorithm.
    Differentially Private Distributed Bayesian Linear Regression with MCMC. (arXiv:2301.13778v1 [stat.ML])
    We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-preserving noise. We develop a novel generative statistical model for privately shared statistics, which exploits a useful distributional relation between the summary statistics of linear regression. Bayesian estimation of the regression coefficients is conducted mainly using Markov chain Monte Carlo algorithms, while we also provide a fast version to perform Bayesian estimation in one iteration. The proposed methods have computational advantages over their competitors. We provide numerical results on both real and simulated data, which demonstrate that the proposed algorithms provide well-rounded estimation and prediction.
    Combinatorial Causal Bandits without Graph Skeleton. (arXiv:2301.13392v1 [cs.LG])
    In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we sacrifice the regret to $O(T^{\frac{2}{3}}\ln T)$ to remove the weight gap covered by the asymptotic notation. At last, we give some discussions and algorithms for pure exploration of the CCB problem without the graph structure.
    Misspecification-robust Sequential Neural Likelihood. (arXiv:2301.13368v1 [stat.ME])
    Simulation-based inference (SBI) techniques are now an essential tool for the parameter estimation of mechanistic and simulatable models with intractable likelihoods. Statistical approaches to SBI such as approximate Bayesian computation and Bayesian synthetic likelihood have been well studied in the well specified and misspecified settings. However, most implementations are inefficient in that many model simulations are wasted. Neural approaches such as sequential neural likelihood (SNL) have been developed that exploit all model simulations to build a surrogate of the likelihood function. However, SNL approaches have been shown to perform poorly under model misspecification. In this paper, we develop a new method for SNL that is robust to model misspecification and can identify areas where the model is deficient. We demonstrate the usefulness of the new approach on several illustrative examples.
    On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. (arXiv:2301.13370v1 [cs.LG])
    Recent work has shown that automatic differentiation over the reals is almost always correct in a mathematically precise sense. However, actual programs work with machine-representable numbers (e.g., floating-point numbers), not reals. In this paper, we study the correctness of automatic differentiation when the parameter space of a neural network consists solely of machine-representable numbers. For a neural network with bias parameters, we prove that automatic differentiation is correct at all parameters where the network is differentiable. In contrast, it is incorrect at all parameters where the network is non-differentiable, since it never informs non-differentiability. To better understand this non-differentiable set of parameters, we prove a tight bound on its size, which is linear in the number of non-differentiabilities in activation functions, and provide a simple necessary and sufficient condition for a parameter to be in this set. We further prove that automatic differentiation always computes a Clarke subderivative, even on the non-differentiable set. We also extend these results to neural networks possibly without bias parameters.
    Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits. (arXiv:2301.13393v1 [cs.LG])
    Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
    Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization. (arXiv:2212.13556v2 [cs.LG] UPDATED)
    To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
    Simplex Random Features. (arXiv:2301.13856v1 [stat.ML])
    We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels by geometrical correlation of random projection vectors. We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels among the class of weight-independent geometrically-coupled positive random feature (PRF) mechanisms, substantially outperforming the previously most accurate Orthogonal Random Features at no observable extra cost. We present a more computationally expensive SimRFs+ variant, which we prove is asymptotically optimal in the broader family of weight-dependent geometrical coupling schemes (which permit correlations between random vector directions and norms). In extensive empirical studies, we show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v2 [stat.ML] UPDATED)
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings: valid inference after data peeking requires correcting for multiple testing but such corrections generally result in low power. Following the principle of testing by betting, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g, the Hilbert-Schmidt independence criterion. Our test is valid under non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.
    Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation. (arXiv:2301.13389v1 [cs.LG])
    While it is tempting to believe that data distillation preserves privacy, distilled data's empirical robustness against known attacks does not imply a provable privacy guarantee. Here, we develop a provably privacy-preserving data distillation algorithm, called differentially private kernel inducing points (DP-KIP). DP-KIP is an instantiation of DP-SGD on kernel ridge regression (KRR). Following a recent work, we use neural tangent kernels and minimize the KRR loss to estimate the distilled datapoints (i.e., kernel inducing points). We provide a computationally efficient JAX implementation of DP-KIP, which we test on several popular image and tabular datasets to show its efficacy in data distillation with differential privacy guarantees.
    Population-wise Labeling of Sulcal Graphs using Multi-graph Matching. (arXiv:2301.13532v1 [stat.ML])
    Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
    Personalized Decentralized Bilevel Optimization over Random Directed Networks. (arXiv:2210.02129v2 [stat.ML] UPDATED)
    Personalization and decentralization are two major lines of studies to realize practical federated learning in the real world. The aim of this study is to establish a general and unified approach that can solve these two problems simultaneously. In this work, we first propose a bilevel problem that can adapt to various personalization scenarios by allowing an arbitrary choice of two parameters: a client-wise outer-parameter representing heterogeneity, and a shared inner-parameter representing homogeneity across client data distributions. We then present an algorithm that can solve this bilevel problem in a decentralized manner by estimating gradients of clients' outer-costs with respect to their outer-parameters. We show that the proposed algorithm can be extended to handle a random directed network, which is one of the most robust decentralized communication classes. The proposed method achieves state-of-the-art performance on a personalization benchmark across various communication settings.
    Bayesian Learning for Neural Networks: an algorithmic survey. (arXiv:2211.11865v4 [stat.ML] UPDATED)
    The last decade witnessed a growing interest in Bayesian learning. Yet, the technicality of the topic and the multitude of ingredients involved therein, besides the complexity of turning theory into practical implementations, limit the use of the Bayesian learning paradigm, preventing its widespread adoption across different fields and applications. This self-contained survey engages and introduces readers to the principles and algorithms of Bayesian Learning for Neural Networks. It provides an introduction to the topic from an accessible, practical-algorithmic perspective. Upon providing a general introduction to Bayesian Neural Networks, we discuss and present both standard and recent approaches for Bayesian inference, with an emphasis on solutions relying on Variational Inference and the use of Natural gradients. We also discuss the use of manifold optimization as a state-of-the-art approach to Bayesian learning. We examine the characteristic properties of all the discussed methods, and provide pseudo-codes for their implementation, paying attention to practical aspects, such as the computation of the gradients.
    Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments. (arXiv:2301.13446v1 [cs.LG])
    We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.  ( 2 min )
    Continuous Soft Pseudo-Labeling in ASR. (arXiv:2211.06007v2 [cs.LG] UPDATED)
    Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.  ( 2 min )
    Scaling laws for single-agent reinforcement learning. (arXiv:2301.13442v1 [cs.LG])
    Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.  ( 2 min )
    Causal Estimation for Text Data with (Apparent) Overlap Violations. (arXiv:2210.00079v2 [stat.ML] UPDATED)
    Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.  ( 2 min )
    Unifying Generative Models with GFlowNets and Beyond. (arXiv:2209.02606v2 [cs.LG] UPDATED)
    There are many frameworks for deep generative modeling, each often presented with their own specific training algorithms and inference methods. Here, we demonstrate the connections between existing deep generative models and the recently introduced GFlowNet framework, a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories. Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models. Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.  ( 2 min )
    Understanding Self-Distillation in the Presence of Label Noise. (arXiv:2301.13304v1 [cs.LG])
    Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.  ( 2 min )
    Hierarchically Clustered PCA, LLE, and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v2 [cs.LG] UPDATED)
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.  ( 2 min )
    Learning Against Distributional Uncertainty: On the Trade-off Between Robustness and Specificity. (arXiv:2301.13565v1 [cs.LG])
    Trustworthy machine learning aims at combating distributional uncertainties in training data distributions compared to population distributions. Typical treatment frameworks include the Bayesian approach, (min-max) distributionally robust optimization (DRO), and regularization. However, two issues have to be raised: 1) All these methods are biased estimators of the true optimal cost; 2) the prior distribution in the Bayesian method, the radius of the distributional ball in the DRO method, and the regularizer in the regularization method are difficult to specify. This paper studies a new framework that unifies the three approaches and that addresses the two challenges mentioned above. The asymptotic properties (e.g., consistency and asymptotic normalities), non-asymptotic properties (e.g., unbiasedness and generalization error bound), and a Monte--Carlo-based solution method of the proposed model are studied. The new model reveals the trade-off between the robustness to the unseen data and the specificity to the training data.  ( 2 min )
    Exploring QSAR Models for Activity-Cliff Prediction. (arXiv:2301.13644v1 [cs.LG])
    Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
    Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors. (arXiv:2201.06463v4 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference, Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs describe (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.  ( 2 min )
    An $l_1$-oracle inequality for the Lasso in high-dimensional mixtures of experts models. (arXiv:2009.10622v5 [math.ST] UPDATED)
    Mixtures of experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.  ( 2 min )
    Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees. (arXiv:2301.13375v1 [cs.LG])
    Robustness and safety are critical for the trustworthy deployment of deep reinforcement learning in real-world decision making applications. In particular, we require algorithms that can guarantee robust, safe performance in the presence of general environment disturbances, while making limited assumptions on the data collection process during training. In this work, we propose a safe reinforcement learning framework with robustness guarantees through the use of an optimal transport cost uncertainty set. We provide an efficient, theoretically supported implementation based on Optimal Transport Perturbations, which can be applied in a completely offline fashion using only data collected in a nominal training environment. We demonstrate the robust, safe performance of our approach on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.  ( 2 min )
    A Unified Causal View of Domain Invariant Representation Learning. (arXiv:2208.06987v3 [stat.ML] UPDATED)
    Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. There are a wide range of proposals for mitigating this problem by learning representations that are ``invariant'' in some sense.However, these methods generally contradict each other, and none of them consistently improve performance on real-world domain shift benchmarks. There are two main questions that must be addressed to understand when, if ever, we should use each method. First, how does each ad hoc notion of ``invariance'' relate to the structure of real-world problems? And, second, when does learning invariant representations actually yield robust models? To address these issues, we introduce a broad formal notion of what it means for a real-world domain shift to admit invariant structure. Then, we characterize the causal structures that are compatible with this notion of invariance.With this in hand, we find conditions under which method-specific invariance notions correspond to real-world invariant structure, and we clarify the relationship between invariant structure and robustness to domain shifts. For both questions, we find that the true underlying causal structure of the data plays a critical role.  ( 2 min )
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R. (arXiv:2103.09603v4 [stat.ML] UPDATED)
    The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.  ( 2 min )
    Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond. (arXiv:2301.13486v1 [stat.ML])
    In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.  ( 2 min )
    Learning in POMDPs is Sample-Efficient with Hindsight Observability. (arXiv:2301.13857v1 [cs.LG])
    POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.  ( 2 min )
    Gaussian Noise is Nearly Instance Optimal for Private Unbiased Mean Estimation. (arXiv:2301.13850v1 [math.ST])
    We investigate unbiased high-dimensional mean estimators in differential privacy. We consider differentially private mechanisms whose expected output equals the mean of the input dataset, for every dataset drawn from a fixed convex domain $K$ in $\mathbb{R}^d$. In the setting of concentrated differential privacy, we show that, for every input such an unbiased mean estimator introduces approximately at least as much error as a mechanism that adds Gaussian noise with a carefully chosen covariance. This is true when the error is measured with respect to $\ell_p$ error for any $p \ge 2$. We extend this result to local differential privacy, and to approximate differential privacy, but for the latter the error lower bound holds either for a dataset or for a neighboring dataset. We also extend our results to mechanisms that take i.i.d.~samples from a distribution over $K$ and are unbiased with respect to the mean of the distribution.  ( 2 min )
    On the Statistical Benefits of Temporal Difference Learning. (arXiv:2301.13289v1 [cs.LG])
    Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.  ( 2 min )
    Fairness and Accuracy under Domain Generalization. (arXiv:2301.13323v1 [cs.LG])
    As machine learning (ML) algorithms are increasingly used in high-stakes applications, concerns have arisen that they may be biased against certain social groups. Although many approaches have been proposed to make ML models fair, they typically rely on the assumption that data distributions in training and deployment are identical. Unfortunately, this is commonly violated in practice and a model that is fair during training may lead to an unexpected outcome during its deployment. Although the problem of designing robust ML models under dataset shifts has been widely studied, most existing works focus only on the transfer of accuracy. In this paper, we study the transfer of both fairness and accuracy under domain generalization where the data at test time may be sampled from never-before-seen domains. We first develop theoretical bounds on the unfairness and expected loss at deployment, and then derive sufficient conditions under which fairness and accuracy can be perfectly transferred via invariant representation learning. Guided by this, we design a learning algorithm such that fair ML models learned with training data still have high fairness and accuracy when deployment environments change. Experiments on real-world data validate the proposed algorithm. Model implementation is available at https://github.com/pth1993/FATDM.  ( 2 min )
    Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion. (arXiv:2301.13743v1 [cs.CV])
    Cross-modality data translation has attracted great interest in image computing. Deep generative models (\textit{e.g.}, GANs) show performance improvement in tackling those problems. Nevertheless, as a fundamental challenge in image translation, the problem of Zero-shot-Learning Cross-Modality Data Translation with fidelity remains unanswered. This paper proposes a new unsupervised zero-shot-learning method named Mutual Information guided Diffusion cross-modality data translation Model (MIDiffusion), which learns to translate the unseen source data to the target domain. The MIDiffusion leverages a score-matching-based generative model, which learns the prior knowledge in the target domain. We propose a differentiable local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling. The $LMI$ captures the identical cross-modality features in the statistical domain for the diffusion guidance; thus, our method does not require retraining when the source domain is changed, as it does not rely on any direct mapping between the source and target domains. This advantage is critical for applying cross-modality data translation methods in practice, as a reasonable amount of source domain dataset is not always available for supervised training. We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models, including adversarial-based and other score-matching-based models.  ( 2 min )
    On the Initialisation of Wide Low-Rank Feedforward Neural Networks. (arXiv:2301.13710v1 [stat.ML])
    The edge-of-chaos dynamics of wide randomly initialized low-rank feedforward networks are analyzed. Formulae for the optimal weight and bias variances are extended from the full-rank to low-rank setting and are shown to follow from multiplicative scaling. The principle second order effect, the variance of the input-output Jacobian, is derived and shown to increase as the rank to width ratio decreases. These results inform practitioners how to randomly initialize feedforward networks with a reduced number of learnable parameters while in the same ambient dimension, allowing reductions in the computational cost and memory constraints of the associated network.  ( 2 min )
    The passive symmetries of machine learning. (arXiv:2301.13724v1 [stat.ML])
    Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry and units covariance, all of which have led to important results in physics. Our goal is to understand the implications of passive symmetries for machine learning: Which passive symmetries play a role (e.g., permutation symmetry in graph neural networks)? What are dos and don'ts in machine learning practice? We assay conditions under which passive symmetries can be implemented as group equivariances. We also discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. While this paper is purely conceptual, we believe that it can have a significant impact on helping machine learning make the transition that took place for modern physics in the first half of the Twentieth century.  ( 2 min )
    Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization. (arXiv:2301.13303v1 [stat.ML])
    To achieve scalable and accurate inference for latent Gaussian processes, we propose a variational approximation based on a family of Gaussian distributions whose covariance matrices have sparse inverse Cholesky (SIC) factors. We combine this variational approximation of the posterior with a similar and efficient SIC-restricted Kullback-Leibler-optimal approximation of the prior. We then focus on a particular SIC ordering and nearest-neighbor-based sparsity pattern resulting in highly accurate prior and posterior approximations. For this setting, our variational approximation can be computed via stochastic gradient descent in polylogarithmic time per iteration. We provide numerical comparisons showing that the proposed double-Kullback-Leibler-optimal Gaussian-process approximation (DKLGP) can sometimes be vastly more accurate than alternative approaches such as inducing-point and mean-field approximations at similar computational complexity.  ( 2 min )
    Fast Optimal Estimation with Intractable Models using Permutation-Invariant Neural Networks. (arXiv:2208.12942v2 [stat.ME] UPDATED)
    Neural networks have recently shown promise for likelihood-free inference, providing orders-of-magnitude speed-ups over classical methods. However, current implementations are suboptimal when estimating parameters from independent replicates. In this paper, we use a decision-theoretic framework to argue that permutation-invariant neural networks are ideally placed for constructing Bayes estimators for arbitrary models, provided that simulation from these models is straightforward. We show that the resulting neural Bayes estimators can quickly and optimally estimate parameters in weakly-identified and highly-parameterised models with relative ease, and that they are highly competitive and much faster than traditional likelihood-based estimators. We apply our estimator on a spatial analysis of sea-surface temperature in the Red Sea where, after training, we obtain parameter estimates, and uncertainty quantification of the estimates via bootstrap sampling, from hundreds of spatial fields in a fraction of a second.  ( 2 min )
    Optimal precision for GANs. (arXiv:2207.10541v2 [cs.LG] UPDATED)
    Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnected distributions. We investigate the relationship between the performance of these models and the geometry of their latent space. Building on recent developments in geometric measure theory, we prove a sufficient condition for optimality in the case where the dimension of the latent space is larger than the number of modes. Through experiments on GANs, we demonstrate the validity of our theoretical results and gain new insights into the latent space geometry of these models. Additionally, we propose a truncation method that enforces a simplicial cluster structure in the latent space and improves the performance of GANs.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v2 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important model class in spatial statistics. They encode prior information about the modeled function and can be used for exact or approximate Bayesian inference. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 2 min )
    Demystifying Disagreement-on-the-Line in High Dimensions. (arXiv:2301.13371v1 [stat.ML])
    Evaluating the performance of machine learning models under distribution shift is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackle this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have lead to the discovery of the disagreement-on-the-line phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny ImageNet-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings.  ( 2 min )
    Structure Learning and Parameter Estimation for Graphical Models via Penalized Maximum Likelihood Methods. (arXiv:2301.13269v1 [stat.ML])
    Probabilistic graphical models (PGMs) provide a compact and flexible framework to model very complex real-life phenomena. They combine the probability theory which deals with uncertainty and logical structure represented by a graph which allows one to cope with the computational complexity and also interpret and communicate the obtained knowledge. In the thesis, we consider two different types of PGMs: Bayesian networks (BNs) which are static, and continuous time Bayesian networks which, as the name suggests, have a temporal component. We are interested in recovering their true structure, which is the first step in learning any PGM. This is a challenging task, which is interesting in itself from the causal point of view, for the purposes of interpretation of the model and the decision-making process. All approaches for structure learning in the thesis are united by the same idea of maximum likelihood estimation with the LASSO penalty. The problem of structure learning is reduced to the problem of finding non-zero coefficients in the LASSO estimator for a generalized linear model. In the case of CTBNs, we consider the problem both for complete and incomplete data. We support the theoretical results with experiments.  ( 2 min )
    Near Optimal Private and Robust Linear Regression. (arXiv:2301.13273v1 [cs.LG])
    We study the canonical statistical estimation problem of linear regression from $n$ i.i.d.~examples under $(\varepsilon,\delta)$-differential privacy when some response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves a near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to guarantee both $(\varepsilon,\delta)$-DP and robustness. Synthetic experiments confirm the superiority of our approach.  ( 2 min )

  • Open

    Reinforcement Learning to Control a 2D Quadcopter
    submitted by /u/Alyx1337 [link] [comments]  ( 40 min )
    Adding a Bonus to Q-function
    Hey, I'm trying to apply RL, precisely DQN, to recommender systems. I want to add a weighted bonus to my Q-function to favor some actions over others. This bonus is a function that depends on (s,a) just like the Q-function. Then, instead of deriving a greedy policy from the original Q, I derive a greedy policy for the bonused function Q'. For some reasons, I would really like to do this instead instead of modifying the reward function itself. My approach works well empirically, however I would want it to be more theoretically grounded. I think what I want ideally is to prove that Q' is optimal for some MDP with a slightly different reward function. And I surprisingly can't find papers doing that in the literature. There are some seemingly related things like Soft Actor Critic which adds an entropy bonus to the value function, but not much more. So, is it ok to tweak a Q-function like this? Is there something I should be careful of? Thank you for your help! submitted by /u/xalendrio [link] [comments]  ( 42 min )
    Multi-Agents Soccer Competition ⚽ (Deep Reinforcement Learning Course by Hugging Face 🤗)
    Hey there 👋 We published the ⚔️ AI vs. AI challenge⚔️, a deep reinforcement learning multi-agents competition. You’ll learn about Multi-agent Reinforcement Learning (MARL), you’ll train your agents to play soccer and you’re going to participate in AI vs. AI challenge where your trained agent will compete against other classmates’ agents every day and be ranked on a new leaderboard. You don’t need to participate in the course to be able to participate in the competition. You can start here 👉 https://huggingface.co/deep-rl-course/unit7/introduction 🏆 The leaderboard 👉 https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos 👀 Visualize your agent competing with our demo 👉https://huggingface.co/spaces/unity/SoccerTwos We also created a discord channel, ai-vs-ai-competition to exchange with others and share advice, you can join our discord server here 👉 hf.co/discord/join https://preview.redd.it/4dc01ktqnlfa1.png?width=1920&format=png&auto=webp&s=c10ff68884683373c631648725aa166364d9494d If you have questions or feedback, I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 41 min )
    Stable Baseines3 not logging success rate for rollouts?
    I am training a PPO agent and have added the appropriate info to my custom environment to log success rate. This now means I am getting logs from my evaluation of success rate but I am not getting this for rollouts. I expect this is due to my environment creation code. As far as I can tell I've done what the documentation says. Am I missing something? env = SubprocVecEnv([lambda: Monitor(SumoEnv(gui=gui), info_keywords=("is_success",)) for i in range(num_envs)], start_method="spawn") submitted by /u/centripetalstranger [link] [comments]  ( 41 min )
    A discrete action in the continuous space?
    I’m currently using stable baselines3 and have 2 options in terms of action. Some acceleration values or a lane change. I broke this down into discrete actions. However, I want to use a continuous action space now and I can’t use a dict action space to keep the accelerations separate. Is it reasonable to make for example 1-3 in the continuous space represent accelerations and then 3-4 represent a lane change? By this I mean, should the agent pick 3.1, 3.2, 3.3, they will all result in a lane change. Will this cause issues? submitted by /u/centripetalstranger [link] [comments]  ( 42 min )
    Share Your Reinforcement Learning Interview Questions
    Hi, I have a technical interview incoming for a entry/mid level RL Engineer position. I have just finished my masters degree and have some experience in RL from previous internships. I am currently using CS 285 at UC Berkeley online lectures to prepare for the interview. I would really appreciate it if you can share the questions that you have faced or asked for similar positions or can recommend what specifics should I focus on for such interviews. submitted by /u/ZIGGY-Zz [link] [comments]  ( 42 min )
    ChatGPT and RL
    Hi I am trying to do classification (nope, it is one-step prediction and it is not sequence dependent). Question: the success of chatgpt, does this mean that we can also use RL (PPO) for just classification? My understanding is that we can just use supervised learning. Does PPO (RL) helps in this case? Thanks? submitted by /u/Dense-Smf-6032 [link] [comments]  ( 42 min )
    Scaling laws for single-agent reinforcement learning (OpenAI)
    submitted by /u/goolulusaurs [link] [comments]  ( 40 min )
    What does the output of the actor network should generally represent?
    Hi, I’m trying to understand some basic concepts of RL. I’m developing a model that should predict the sum of future rewards for any given state (simplified version of bellman’s equation). Then it should compare the actual future reward and it’s prediction with the loss function and backpropagate. This seems to be pretty standard. What I’m not getting, is that when I’m generating my batch of data (for the offline training), I think that the standard should be to choose the action based on a categorical distribution of the predictions for each action (or use epsilon greedy). The problem is that if i have any negative prediction, even if it’s random, it will never reach that state and never update based on it. Is that right? Is it how it’s supposed to be or am I having the wrong concept of what the network should output. Thanks in advance! submitted by /u/enzodtz [link] [comments]  ( 44 min )
    What does the following loss function mean in reinforcement learning?
    I am not an expert on policy learning or reinforcement learning, and I am studying this paper "https://openaccess.thecvf.com/content/ICCV2021/papers/Sun_Dynamic_Network_Quantization_for_Efficient_Video_Inference_ICCV_2021_paper.pdf" right now, but got confused by eq(11) The captial Omega is the action space and small "a_i" is the i-th action in a trajectory. T is the total number of actions in this trajectory,the small k is a selected action from the action space "Omega". I don't understand what symbol it is after the 2nd summation (\sum)in the equation, and I am not sure if it is an "L" or "I". Could someone provide some guidance on what the following function might mean, and why it can be used to achieve balanced policy usage? https://preview.redd.it/n4deietkaifa1.png?width=958&format=png&auto=webp&s=919453f2d7516bd2908c6a23fe6aa1e6fc04edf8 submitted by /u/AaronSpalding [link] [comments]  ( 43 min )
    RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)
    I am trying to write my own actor critic algorithm. Unlike other implementations, I tried to keep a separate actor and critic network. The problem arises somewhere in my actor or critic loss function ​ The error is originating here - advantage = nrml_disc_rewards-values critic_loss = advantage.pow(2).mean() actor_loss = -(torch.sum(torch.log(prob_batch)*advantage)) policy_opt.zero_grad() actor_loss.backward() policy_opt.step() value_opt.zero_grad() critic_loss.backward() value_opt.step() This is the full traceback - D:\q_learning\actor_critic.py:90: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\cb\pytorch_100000…  ( 44 min )
  • Open

    [D] Any open source model, or application to remove no speech parts of a video?
    Currently I am using Davinci Resolve free edition to manually cut / remove no speech parts, or the parts where I take a breath It is extremely time consuming I am pretty sure this can be done via AI For example whisper is able to detect where we use filler words such as umh, um, uh etc That would be awesome to automatically remove these parts from a video Just direct me where to look thank you submitted by /u/CeFurkan [link] [comments]  ( 43 min )
    [N] OpenAI starts selling subscriptions to its ChatGPT bot
    https://www.axios.com/2023/02/01/chatgpt-subscriptions-chatbot-openai Not fully paywalled, but there's a tiering system. submitted by /u/bikeskata [link] [comments]  ( 42 min )
    [D] Normalizing Flows in 2023?
    What is the state of research in normalizing flows in 2023? Have they been superseded by diffusion models for sample generation? If so, what are some other applications where normalizing flows are still SOTA (or even useful)? submitted by /u/wellfriedbeans [link] [comments]  ( 42 min )
    [D] Advice for a multi-label classification problem
    Hi guys, I have a dataset of 12,000 products, each of which consists of a title, description, and some images. In addition, I also have a pre-defined set of product categories. Curious to learn if anyone has any suggestions on what model to be used to train using this dataset as input to classify each product in the dataset into the related categories within the given set? submitted by /u/dle88 [link] [comments]  ( 42 min )
    [D] Why is stable diffusion much smaller than predecessors?
    Stable diffusion seems to be a departure from the trend of building larger and larger models. It has 10x less parameters than other image generation models like DALLE-2. “Incredibly, compared with DALL-E 2 and Imagen, the Stable Diffusion model is a lot smaller. While DALL-E 2 has around 3.5 Billion parameters, and Imagen has 4.6 Billion, the first Stable Diffusion model has just 890 million parameters, which means it uses a lot less VRAM and can actually be run on consumer-grade graphics cards.” What allows stable diffusion to work so well with a lot less parameters? Are there any drawbacks to this, like requiring stable diffusion to be fine tuned more than DALLE-2 for example? submitted by /u/dahdarknite [link] [comments]  ( 43 min )
    [R] Extracting Training Data from Diffusion Models
    https://twitter.com/eric_wallace_/status/1620449934863642624?s=46&t=GVukPDI7944N8-waYE5qcw Extracting training data from diffusion models is possible by following, more or less, these steps: Compute CLIP embeddings for the images in a training dataset. Perform an all-pairs comparison and mark the pairs with l2 distance smaller than some threshold as near duplicates Use the prompts for training samples marked as near duplicates to generate N synthetic samples with the trained model Compute the all-pairs l2 distance between the embeddings of generated samples for a given training prompt. Build a graph where the nodes are generated samples and an edge exists if the l2 distance is less than some threshold. If the largest clique in the resulting graph is of size 10, then the training sample is considered to be memorized. Visually inspect the results to determine if the samples considered to be memorized are similar to the training data samples. With this method, the authors were able to find samples from Stable Diffusion and Imagen corresponding to copyrighted training images. submitted by /u/pm_me_your_pay_slips [link] [comments]  ( 45 min )
    [D] Vectorizing computation of the Jaccard similarity between all instances in a large dataset in Python
    I am trying to claculate the Jaccard similarity between all instances in my dataframe. I am using the following method to do so, however, this method is painfully slow. My ```data_with_labels``` shape is (221277, 217). # Compute the Jaccard similarity between all instances n_instances = data_with_labels.shape[0] jaccard_similarity_matrix = np.zeros((n_instances, n_instances)) for i in range(n_instances): for j in range(n_instances): jaccard_similarity_matrix[i, j] = jaccard_score(data_with_labels[i, :], data_with_labels[j, :], average='micro') ​ Is there any way to do this process with numpy vectorization? I tried soomething like this but keep getting this error: n_instances = data_with_labels.shape[0] jaccard_similarity_matrix = np.zeros((n_instances, n_instances)) for i in range(n_instances): jaccard_similarity_matrix[i, :] = jaccard_score(data_with_labels[i, :], data_with_labels, average='micro') ValueError: Found input variables with inconsistent numbers of samples: [217, 221277] ​ submitted by /u/hopedallas [link] [comments]  ( 43 min )
    [R] On the Expressive Power of Geometric Graph Neural Networks
    Geometric GNNs are an emerging class of GNNs for spatially embedded graphs in scientific and engineering applications, s.a. biomolecular structure, material science, and physical simulations. Notable examples include SchNet, DimeNet, Tensor Field Networks, and E(n) Equivariant GNNs. How powerful are geometric GNNs? How do key design choices influence expressivity and how to build maximally powerful ones? Check out this recent paper for more: 📄 PDF: http://arxiv.org/abs/2301.09308 💻 Code: http://github.com/chaitjo/geometric-gnn-dojo 💡Key findings: https://twitter.com/chaitjo/status/1617812402632019968 P.S. Are you new to Geometric GNNs, GDL, PyTorch Geometric, etc.? Want to understand how theory/equations connect to real code? Try this Geometric GNN 101 notebook before diving in: https://github.com/chaitjo/geometric-gnn-dojo/blob/main/geometric_gnn_101.ipynb submitted by /u/chaitjo [link] [comments]  ( 43 min )
    [P] An open source tool for repeatable PyTorch experiments by embedding your code in each model checkpoint
    I made a new open source tool called JellyML that lets you go back to any of your checkpoints, and reproduce your code exactly as it was when you trained it. You can find the website here: https://jellyml.com The GitHub repo: https://gitHub.com/mmulet/jellyml You can install it with pip: pip install jellyml submitted by /u/latefordinnerstudios [link] [comments]  ( 42 min )
    [D] What does a DL role look like in ten years?
    Every day, there seems to be new evidence of the generalization capabilities of LLMs. What does this mean for the future role of deep learning experts in academia and business? It seems like there's a significant chance that skills such as PyTorch and Jax will be displaced by prompt construction and off-the-shelf model APIs, with only a few large institutions working on the DNN itself. Curious to hear others' thoughts on this. submitted by /u/PassingTumbleweed [link] [comments]  ( 47 min )
    [D] Tortoise TTS API for GPT-3.
    Hey everyone, I thought of an idea to create a human like realistic voice assistant for ChatGPT. So I have a question that can we make an API of tortoise TTS trained on a specific voice. I've seen a lot of companies nowadays that provides most realistic text to speech solutions like eleven labs etc. Do they train these voices on tortoise TTS?? If there is another way of creating highly realistic voices and make an API of it, then please tell me how can I do it? And also how can I make this process fast as regular normal TTS? submitted by /u/akshaysri0001 [link] [comments]  ( 43 min )
    [P] predictive modeling- Multi stage classification
    Problem statement: assume a user come into a system and it typically takes 10 weeks for outcome(yes,no). I want to build a model which predicts the outcome on any particular week say how likely are they gonna succeed on week 1,2,3 etc. Question on model building approach: should I build weekly models and get the prediction ? Or is there a better way to do it. Ideally it would be great have single model that can be used for different weeks. I prefer the latter. Appreciate your ideas submitted by /u/R-PRADY [link] [comments]  ( 42 min )
    [P] NER output label post processing
    I’m looking to some aggregation on academic research and news articles to see what insights I get from it. I’m using textrazor to do named entity recognition on the documents, but getting a lot of dirty labels that have slightly different wording. For example, Tesla, Tesla ltd, Tesla Ltd. As a result, my aggregations have a lot of duplicate results. The dataset consists of about 4M labels so the solution has to be efficient to be viable. I was thinking of putting the labels through word2vec and then clustering them based on the word embedding distances? But then the problem arises of how many clusters to use? I’ve also tried simple regex preprocessing to get rid of the company abbreviations but there are other examples that cannot be solved that easily. submitted by /u/hasiemasie [link] [comments]  ( 43 min )
    [D] A report that compares the practices of high-performing companies in Europe to laggards in AI adoption
    Discussion about a report that compares the practices and attitudes of companies that self-report as ahead of the competition in AI adoption in Europe, compared to companies that identify as behind or at the same stage as their competitors. It contains some interesting findings mixed with some somewhat obvious things. Kinda obvious that leading companies also are further ahead in using MLOps, but I thought it was interesting to see the frequency of fine-tuning and retraining. Not as obvious that most companies report a lack of access to training data, would have thought that is mostly something that smaller companies have issues with. Also not so obvious to me is that companies with a centralized decision-making related to AI seem to dominate among high-performers. Interesting that most companies seem to get some value out of their AI/ML projects, which seems to contradict some of the previous forecasts by the big consultancy companies. Link to the report: https://stagezero.ai/2022-survey-report/ submitted by /u/madnessone1 [link] [comments]  ( 43 min )
    [R] EMNLP video interviews, workshops, and posters
    I learned a lot at EMNLP in December and captured some of what I learned in this video. Interviews I asked five NLP researchers these questions: 1- What is the most exciting development in NLP in 2022 2- What are you looking forward to in 2023? 3- What is an underrated idea that the field should pay more attention to? Their answers start at 01:22. Workshops I got to spend time at these workshops: Generation, Evaluation & Metrics (GEM) Massively Multilingual NLU Blackbox NLP My main takeaways are at 09:25. Posters If you've been to a conference you'd know there's an overwhelming number of posters. I recorded four of the ones I came across and thought were interesting (covering retrieval-augmented text generation, human evaluation, the BLOOM multimodal dataset, and a multimodal method to name music playlists). Poster presentations start at 14:38 Full video: https://www.youtube.com/watch?v=plCvF_7qrmY ​ What's your answer to these questions? 1- What is the most exciting development in NLP in 2022 2- What are you looking forward to in 2023? 3- What is an underrated idea that the field should pay more attention to? ​ submitted by /u/jayalammar [link] [comments]  ( 43 min )
    [P] A CLI tool for easy transformer sequence classifier training and inference
    Hi everyone, I have developed a CLI tool to train a transformer sequence classification model. There are also options for preprocessing data and inference on new data. I was thinking that interesting use cases might be found within economics/finance and biological domains, and would be super interested in feedback on: - if the documentation is intelligible and enables you to use it - to which use cases from your industry/domain could discrete sequence modelling be applied - what additional features you'd need for it to be useful to you Basically, where would the prediction of a class (or the next item) based on discrete events/objects/tokens be useful? The project is called "sequifier" and can be found here: https://github.com/0xideas/sequifier submitted by /u/0xideas [link] [comments]  ( 43 min )
    [P] Self Hostable OpenAI Alternative
    Hi, Text-Generator.io is now self hostable, It's priced at $1000 USD per instance per year to self host. The service runs on a single 24GB VRAM GPU, and runs all services including speech to text, text and code generation for almost all languages and generating embeddings too. The text generator also downloads and analyses any input with links including documents, images, images with text inside and webpages for better understanding and to generate better text. It's a great alternative to OpenAI and has a compatible API making switching easy. You can check out the new pricing here. Let me know what you think and if there's anything i can do to help! All the best. Lee Penkman - Founder Text-Generator.io submitted by /u/leepenkman [link] [comments]  ( 44 min )
    [R] SETI finds eight potential alien signals with ML
    GitHub (sadly without weights). https://github.com/PetchMa/ML_GBT_SETI News. https://www-scinexx-de.translate.goog/news/kosmos/seti-findet-acht-potenzielle-alien-signale/?_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp submitted by /u/logTom [link] [comments]  ( 44 min )
    [R] Faithful Chain-of-Thought Reasoning
    Paper : https://arxiv.org/abs/2301.13379 Abstract : While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a faithful-by-construction framework that decomposes a reasoning task into two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. We demonstrate the efficacy of our approach on 10 reasoning datasets from 4 diverse domains. It outperforms traditional CoT prompting on 9 out of the 10 datasets, with an average accuracy gain of 4.4 on Math Word Problems, 1.9 on Planning, 4.0 on Multi-hop Question Answering (QA), and 18.1 on Logical Inference, under greedy decoding. Together with self-consistency decoding, we achieve new state-of-the-art few-shot performance on 7 out of the 10 datasets, showing a strong synergy between faithfulness and accuracy. submitted by /u/starstruckmon [link] [comments]  ( 44 min )
    [D] Audio segmentation - Machine Learning algorithm to segment a audio file into multiple class
    Can someone suggest a machine learning model that will segment audio spectrogram to multiple classes. I have labeled data of heart beats. S1, S2, systole and diastole. How to train a segmentation model ? submitted by /u/PlayfulMenu1395 [link] [comments]  ( 42 min )
  • Open

    MusicLM: Generating Music From Text - a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff"
    submitted by /u/magenta_placenta [link] [comments]  ( 40 min )
    Just wanted to share this story created by Chat-GPT3
    Once upon a time, in the Hundred Acre Wood, Winnie the Pooh was feeling down in the dumps. All of his friends were busy with their own activities and he was feeling left out. Pooh decided to take a walk and ended up stumbling upon a mysterious honey pot that had "BB" written on it. Curious, Pooh started to investigate and discovered that the honey pot was actually a cover for a hidden laboratory run by a character similar to Jesse Pinkman from Breaking Bad. The character was cooking up a batch of special honey, which Pooh was immediately drawn to. Despite the character's warnings, Pooh couldn't resist the delicious aroma and sneaked a taste. He was shocked to find that the honey was not only the sweetest he had ever tasted, but it also gave him a burst of energy and focus that he had never experienced before. Intrigued, Pooh started to spend more time with the character, helping him with his honey making and learning about the science behind it. Pooh soon discovered that the character was using the honey to pay for his granddaughter's medical treatment, just like in Breaking Bad. Pooh felt a sense of camaraderie with the character and wanted to help. Together, they came up with a plan to create a legitimate business selling the special honey, but with a focus on using the profits to help others in need. As their business grew, so did their friendship. Pooh was no longer feeling left out, as he was now a part of something important and fulfilling. The two worked together, using their newfound skills and knowledge, to bring joy and happiness to the residents of the Hundred Acre Wood. And so, Winnie the Pooh and his friend from Breaking Bad lived happily ever after, spreading the sweet nectar of kindness and generosity to all. submitted by /u/v1ll3_m [link] [comments]  ( 42 min )
    Perplexity Ask is now available as a Chrome extension. With AI help, you can read quick answers from your extension bar, click on sources, and navigate to http://perplexity.ai when needed. Searches are filled in with a single click from Google and Bing:
    submitted by /u/rafs2006 [link] [comments]  ( 41 min )
    OpenAI rolls out ChatGPT Plus for $20 a month
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    Just dropped
    submitted by /u/zCaptainBr0 [link] [comments]  ( 41 min )
    📌[Searchcolab] "Future of National Park in USA due to Climate Change". Link in comment
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Gmail creator says ChatGPT might "destroy" Google within 2 years
    submitted by /u/ExperienceKCC [link] [comments]  ( 40 min )
    Career Advice?
    im a junior in high school and up until now i had no idea what i wanted to study. after this wave of ai stuff happened and i realized how interested i am in it, im wondering if maybe i should study something involving ai. i have no idea where to start though. software engineer? machine learning engineer? i would love if someone could help decide what i actually would want to do for a career and what i should major in to get me there. thank you. submitted by /u/nicdunz [link] [comments]  ( 41 min )
    All this fuss over Open AI political bias is blather until...
    Biden types in, "To what extent should our military be involved in the war in Ukraine?" submitted by /u/yoitscoach [link] [comments]  ( 40 min )
    Timecoded video and text to artificial voiceover
    Screenshot of an example Hello, I have a question, now it is much easier to transcribe a video with artificial intelligence and then import to Premiere or Vegas to create subtitles inserted in the video image itself, but can you think of a way to convert that same text to artificial voice and create audio file or track with the same timecode or just sync with the original video? It could be done with any text-to-speech converter, but then you would have to manually cut and wrap each piece of text. Example: 1 00:00:00,000 --> 00:00:02,520 In this space we like to feel a bit of everything, 2 00:00:02,600 --> 00:00:04,600 but the important thing is to make dance. 3 00:00:04,680 --> 00:00:06,680 A task performed by all digiles 4 00:00:06,760 --> 00:00:12,160 and one who has been doing this since the 80s is the maestro Maik 5 00:00:12,240 --> 00:00:14,240 Good evening Maik. ​ submitted by /u/mamomo1 [link] [comments]  ( 41 min )
    The History of Artificial Intelligence: Understanding the Brain, explores Reinforcement Learning and Perceptron
    What makes us think? What is inside the brain that makes us conscious? Can we build a universal AI machine to study and understand the universe? https://www.youtube.com/watch?v=AsXx9gyh39M https://preview.redd.it/endsstfshlfa1.png?width=1920&format=png&auto=webp&s=7677e3d0d463e39de0b423a6881ce3876acbe06b submitted by /u/Ok-District-4701 [link] [comments]  ( 40 min )
    See Roy Lichtenstein's Staggering 0-year-old Portrait In Sharp Focus!
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    Flawless AI lets you change the dialogue on a video and the lips sync absolutely perfectly to each word. Could be big for the movie industry.
    submitted by /u/Dalembert [link] [comments]  ( 44 min )
    The steam engine changed the world. Artificial intelligence could destroy it. - The Boston Globe
    submitted by /u/GlobeOpinion [link] [comments]  ( 40 min )
    Frida Kahlo Paints Grandmother & Grandchild W/ Contemplative Rainforest Vibe
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    The Best Curated List of A.I. Newsletters Ever (Feb, 2023 with Twitter handles).
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    OpenAI’s new ChatGPT tool may help you tell if text was written by a human or AI
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Google is reportedly testing an alternate home page with ChatGPT-style Q&A prompts
    This is laughable.They were sitting on all of the technology.And now they scramble to do something better than 10 links.I for myself will be disappointed with anything less than movie Her. It's a high bar.May be.I would not expect personality.May be some rudementary memory.But the ability to perform almost any digital task must be there.It can be built in a garage using open source projects.COME ON.Some good programmers and hackathon.Yes I am waiting for stability ai model.Or may be gpt 3 API can be used.But submitted by /u/nikitastaf1996 [link] [comments]  ( 41 min )
    Top 9 generative AI tools
    In 2023, generative AI tools will disrupt how we create and share content. What are your favorite generative AI tools? AI avatar - Synthesia AI-generated automations - Bardeen.ai Copy - copy.ai Personalized videos - Rephrase.ai Video editing - Descript Content creation - Type Studio Voice over - Murf.ai Design - Designs.ai Background music - Soundraw Read the full article https://www.bardeen.ai/posts/generative-ai-tools submitted by /u/Intelligent_Shop_012 [link] [comments]  ( 41 min )
    OpenAI Has Launched ChatGPT Content Detection Tool
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    What is Google's MusicLM? (What are your impressions of it?)
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
  • Open

    The Future of AI: GPT-3 vs GPT-4: A Comparative Analysis
    In this post, we will dive deep into the world of Artificial Intelligence and take a closer look at two of the most advanced AI algorithms… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
    ChatGPT’s authorship: Is it time to redefine authorship in the age of AI?
    In this blog post, we will take a closer look at the implications of ChatGPT’s authorship, the role of AI in scientific literature, and… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 8 min )
    Day 7: Advance SQL For Data Science
    So far this is the 7th blog in the journey of basics to advance SQL. you can refer to previous blogs for learning SQL from scratch, This…  ( 8 min )
    How Linear Regression leads to Logistic Regression
    Linear & Logistic: The Relationship Between Regression Models Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Meet Mr.ChatGPT: A Large Language Model Trained by OpenAI
    Hello and welcome to the blog! My name is ChatGPT, and I am a large language model trained by OpenAI.  P.S. This article includes a use…  ( 9 min )
  • Open

    MIT Solve announces 2023 global challenges and Indigenous Communities Fellowship
    More than $1 million in funding available to selected Solver teams and fellows.  ( 7 min )
  • Open

    How to decide between Amazon Rekognition image and video API for video moderation
    Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, […]  ( 10 min )
    Scaling distributed training with AWS Trainium and Amazon EKS
    Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and […]  ( 11 min )
  • Open

    The Flan Collection: Advancing open source methods for instruction tuning
    Posted by Shayne Longpre, Student Researcher, and Adam Roberts, Senior Staff Software Engineer, Google Research, Brain Team Language models are now capable of performing many new natural language processing (NLP) tasks by reading instructions, often that they hadn’t seen before. The ability to reason on new tasks is mostly credited to training models on a wide variety of unique instructions, known as “instruction tuning”, which was introduced by FLAN and extended in T0, Super-Natural Instructions, MetaICL, and InstructGPT. However, much of the data that drives these advances remain unreleased to the broader research community.  In “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning”, we closely examine and release a newer and more extensive publicly ava…  ( 92 min )
  • Open

    Introducing ChatGPT Plus
    We’re launching a pilot subscription plan for ChatGPT, a conversational AI that can chat with you, answer follow-up questions, and challenge incorrect assumptions. The new subscription plan, ChatGPT Plus, will be available for $20/month, and subscribers will receive a number of benefits: General access to ChatGPT, even  ( 2 min )
  • Open

    Train YOLOv8 on Custom Dataset – A Complete Tutorial
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Study: Superconductivity switches on and off in “magic-angle” graphene
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Deltas and Delta-Deltas Features Explained
    Hi guys, I have made a video on YouTube here where I explain how deltas and delta-deltas features are computed. These are used quite a lot in speech recognition systems. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    Help NeuralNetwork on Python and RapidMiner
    Hi right now i have to implement a neural network from rapidminer into a python script to predict a value, but i cant get what is wrong with my program please help. import pandas as pd import math def sig(x): return 1 / (1 + math.exp(-x)) #Funcion de perceptron class Perceptron: #Constructor def __init__(self,weights,bias): self.weights = weights self.bias = bias self.output = 0 def setOutput(self,value): self.output = value def getBias(self): return self.bias def getOutput(self): return self.output def guess(self,input): sum = 0 for i in range(10): sum = sum + (input[i]*self.weights[i]) sum = sum + self.bias self.output = sig(sum) return self.output class Output: def __init__(self,nodes,threshold,name): self.nodes = nodes self.threshold = threshold self.name = name def guess(self,input):…  ( 42 min )
  • Open

    Meet the Omnivore: Architectural Researcher Lights Up Omniverse Scenes With ‘SunPath’ Extension
    Things are a lot sunnier these days for designers looking to visualize their projects in NVIDIA Omniverse, a platform for creating and operating metaverse applications.  ( 6 min )
    Deloitte’s Nitin Mittal on the Secrets of ‘All-In’ AI Success
    Artificial intelligence is the new electricity. The fifth industrial revolution. And companies that go all-in on AI are reaping the rewards. So how do you make that happen? That big question — how? — is explored by Nitin Mittal, principal at Deloitte, one of the world’s largest professional services organizations, and co-author Thomas Davenport in Read article >  ( 4 min )
  • Open

    Will ChatGPT Make Fraud Easier?
    Less than 24 hours after posting my previous Data Science Central article (here), dozens of illegitimate copies started to pop up on various websites. Below is an example (title + first paragraph): Fake: An experimental guide to the Riemann conjecture — the correct term is heuristic evidence. It is a strong argument based on empirical evidence rather… Read More »Will ChatGPT Make Fraud Easier? The post Will ChatGPT Make Fraud Easier? appeared first on Data Science Central.  ( 22 min )
  • Open

    FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients. (arXiv:2210.00737v3 [cs.LG] UPDATED)
    Federated Learning (FL) is a collaborative learning performed by a moderator that protects data privacy. Existing cross-silo FL solutions seldom address the absence of participating clients during training which can seriously degrade model performances, particularly for unbalanced and non-IID client data. We address this issue by generating secure data digests from the raw data and using them to guide model training at the FL moderator. The proposed FL with data digest (FedDig) framework can tolerate unexpected client absence while preserving data privacy. This is achieved by de-identifying digests by mixing and perturbing the encoded features of the raw data in the feature space. The feature perturbing is performed following the Laplace mechanism of Differential Privacy. We evaluate FedDig on EMNIST, CIFAR-10, and CIFAR-100 datasets. The results consistently outperform three baseline algorithms (FedAvg, FedProx, and FedNova) by large margins in multiple client absence scenarios.  ( 2 min )
    Finite-Time Analysis of Fully Decentralized Single-Timescale Actor-Critic. (arXiv:2206.05733v2 [cs.LG] UPDATED)
    Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the \emph{single-timescale} update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized \emph{single-timescale} AC algorithm.Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.  ( 2 min )
    Mirror Sinkhorn: Fast Online Optimization on Transport Polytopes. (arXiv:2211.10420v2 [cs.LG] UPDATED)
    Optimal transport is an important tool in machine learning, allowing to capture geometric properties of the data through a linear program on transport polytopes. We present a single-loop optimization algorithm for minimizing general convex objectives on these domains, utilizing the principles of Sinkhorn matrix scaling and mirror descent. The proposed algorithm is robust to noise, and can be used in an online setting. We provide theoretical guarantees for convex objectives and experimental results showcasing it effectiveness on both synthetic and real-world data.
    FedCliP: Federated Learning with Client Pruning. (arXiv:2301.06768v2 [cs.LG] UPDATED)
    The prevalent communication efficient federated learning (FL) frameworks usually take advantages of model gradient compression or model distillation. However, the unbalanced local data distributions (either in quantity or quality) of participating clients, contributing non-equivalently to the global model training, still pose a big challenge to these works. In this paper, we propose FedCliP, a novel communication efficient FL framework that allows faster model training, by adaptively learning which clients should remain active for further model training and pruning those who should be inactive with less potential contributions. We also introduce an alternative optimization method with a newly defined contribution score measure to facilitate active and inactive client determination. We empirically evaluate the communication efficiency of FL frameworks with extensive experiments on three benchmark datasets under both IID and non-IID settings. Numerical results demonstrate the outperformance of the porposed FedCliP framework over state-of-the-art FL frameworks, i.e., FedCliP can save 70% of communication overhead with only 0.2% accuracy loss on MNIST datasets, and save 50% and 15% of communication overheads with less than 1% accuracy loss on FMNIST and CIFAR-10 datasets, respectively.  ( 2 min )
    Mega: Moving Average Equipped Gated Attention. (arXiv:2209.10655v3 [cs.LG] UPDATED)
    The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.  ( 2 min )
    Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size. (arXiv:2211.11092v2 [cs.LG] UPDATED)
    Training large neural networks is known to be time-consuming, with the learning duration taking days or even weeks. To address this problem, large-batch optimization was introduced. This approach demonstrated that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. While long training time was not typically a major issue for model-free deep offline RL algorithms, recently introduced Q-ensemble methods achieving state-of-the-art performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how this class of methods can benefit from large-batch optimization, which is commonly overlooked by the deep offline RL community. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time, effectively shortening training duration by 3-4x times on average.  ( 2 min )
    Safe and Adaptive Decision-Making for Optimization of Safety-Critical Systems: The ARTEO Algorithm. (arXiv:2211.05495v2 [cs.LG] UPDATED)
    We consider the problem of decision-making under uncertainty in an environment with safety constraints. Many business and industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown characteristics, real-time optimization becomes challenging, particularly because of the satisfaction of safety constraints. We propose the ARTEO algorithm, where we cast multi-armed bandits as a mathematical programming problem subject to safety constraints and learn the unknown characteristics through exploration while optimizing the targets. We quantify the uncertainty in unknown characteristics by using Gaussian processes and incorporate it into the cost function as a contribution which drives exploration. We adaptively control the size of this contribution in accordance with the requirements of the environment. We guarantee the safety of our algorithm with a high probability through confidence bounds constructed under the regularity assumptions of Gaussian processes. We demonstrate the safety and efficiency of our approach with two case studies: optimization of electric motor current and real-time bidding problems. We further evaluate the performance of ARTEO compared to a safe variant of upper confidence bound based algorithms. ARTEO achieves less cumulative regret with accurate and safe decisions.  ( 2 min )
    SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. (arXiv:2206.05794v3 [cs.LG] UPDATED)
    In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.  ( 2 min )
    Transformers over Directed Acyclic Graphs. (arXiv:2210.13148v3 [cs.LG] UPDATED)
    Transformer models have recently gained popularity in graph representation learning as they have the potential to learn complex relationships beyond the ones captured by regular graph neural networks. The main research question is how to inject the structural bias of graphs into the transformer architecture, and several proposals have been made for undirected molecular graphs and, recently, also for larger network graphs. In this paper, we study transformers over directed acyclic graphs (DAGs) and propose architecture adaptations tailored to DAGs: (1) An attention mechanism that is more efficient than the regular quadratic complexity of transformers and at the same time faithfully captures the DAG structure, and (2) a positional encoding of the DAG's partial order, complementing the former. We rigorously evaluate our framework in ablation studies and show that it is effective in improving different kinds of baseline transformers over various types of data, in experiments ranging from classifying source code graphs to nodes in self-citation networks. In particular, our proposal makes (graph) transformers competitive to or outperform graph neural networks tailored to DAGs.  ( 2 min )
    Learning under Data Drift with Time-Varying Importance Weights. (arXiv:2210.01422v2 [cs.LG] UPDATED)
    Real-world deployment of machine learning models is challenging when data evolves over time. And data does evolve over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a novel time-varying importance weight estimator that can detect gradual shifts in the distribution of data. Such an importance weight estimator allows the training method to selectively sample past data -- not just similar data from the past like a standard importance weight estimator would but also data that evolved in a similar fashion in the past. Our time-varying importance weight is quite general. We demonstrate different ways of implementing it that exploit some known structure in the evolution of data. We demonstrate and evaluate this approach on a variety of problems ranging from supervised learning tasks (multiple image classification datasets) where the data undergoes a sequence of gradual shifts of our design to reinforcement learning tasks (robotic manipulation and continuous control) where data undergoes a shift organically as the policy or the task changes.
    Exploring Efficient-tuning Methods in Self-supervised Speech Models. (arXiv:2210.06175v3 [eess.AS] UPDATED)
    In this study, we aim to explore efficient tuning methods for speech self-supervised learning. Recent studies show that self-supervised learning (SSL) can learn powerful representations for different speech tasks. However, fine-tuning pre-trained models for each downstream task is parameter-inefficient since SSL models are notoriously large with millions of parameters. Adapters are lightweight modules commonly used in NLP to solve this problem. In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained. Given the lack of studies generally exploring the effectiveness of adapters for self-supervised speech tasks, we intend to fill this gap by adding various adapter modules in pre-trained speech SSL models. We show that the performance parity can be achieved with over 90% parameter reduction, and discussed the pros and cons of efficient tuning techniques. This is the first comprehensive investigation of various adapter types across speech tasks.
    Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. (arXiv:2211.11031v2 [cs.LG] UPDATED)
    Large pre-trained models decay over long-term deployment as input distributions shift, user requirements change, or crucial knowledge gaps are discovered. Recently, model editors have been proposed to modify a model's behavior by adjusting its weights during deployment. However, when editing the same model multiple times, these approaches quickly decay a model's performance on upstream data and forget how to fix previous errors. We propose and study a novel Lifelong Model Editing setting, where streaming errors are identified for a deployed model and we update the model to correct its predictions without influencing unrelated inputs without access to training edits, exogenous datasets, or any upstream data for the edited model. To approach this problem, we introduce General Retrieval Adaptors for Continual Editing, or GRACE, which learns to cache a chosen layer's activations in an adaptive codebook as edits stream in, leaving original model weights frozen. GRACE can thus edit models thousands of times in a row using only streaming errors, without influencing unrelated inputs. Experimentally, we show that GRACE improves over recent alternatives and generalizes to unseen inputs. Our code is available at https://www.github.com/thartvigsen/grace.
    MixFlows: principled variational inference via mixed flows. (arXiv:2205.07475v3 [stat.ML] UPDATED)
    This work presents mixed variational flows (MixFlows), a new variational family that consists of a mixture of repeated applications of a map to an initial reference distribution. First, we provide efficient algorithms for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. We then show that MixFlows have MCMC-like convergence guarantees when the flow map is ergodic and measure-preserving, and provide bounds on the accumulation of error for practical implementations where the flow map is approximated. Finally, we develop an implementation of MixFlows based on uncorrected discretized Hamiltonian dynamics combined with deterministic momentum refreshment. Simulated and real data experiments show that MixFlows can provide more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.
    Transformers Can Be Expressed In First-Order Logic with Majority. (arXiv:2210.02671v3 [cs.LG] UPDATED)
    Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can the inner decision process of neural networks be captured symbolically in some familiar logic? We show that any fixed-precision transformer neural network can be translated into an equivalent fixed-size $\mathsf{FO}(\mathsf{M})$ formula, i.e., a first-order logic formula that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. The proof idea is to design highly uniform boolean threshold circuits that can simulate transformers, and then leverage known theoretical connections between circuits and logic. Our results reveal a surprisingly simple formalism for capturing the behavior of transformers, show that simple problems like integer division are "transformer-hard", and provide valuable insights for comparing transformers to other models like RNNs. Our results suggest that first-order logic with majority may be a useful language for expressing programs extracted from transformers.
    Segmenting thalamic nuclei from manifold projections of multi-contrast MRI. (arXiv:2301.06114v2 [eess.IV] UPDATED)
    The thalamus is a subcortical gray matter structure that plays a key role in relaying sensory and motor signals within the brain. Its nuclei can atrophy or otherwise be affected by neurological disease and injuries including mild traumatic brain injury. Segmenting both the thalamus and its nuclei is challenging because of the relatively low contrast within and around the thalamus in conventional magnetic resonance (MR) images. This paper explores imaging features to determine key tissue signatures that naturally cluster, from which we can parcellate thalamic nuclei. Tissue contrasts include T1-weighted and T2-weighted images, MR diffusion measurements including FA, mean diffusivity, Knutsson coefficients that represent fiber orientation, and synthetic multi-TI images derived from FGATIR and T1-weighted images. After registration of these contrasts and isolation of the thalamus, we use the uniform manifold approximation and projection (UMAP) method for dimensionality reduction to produce a low-dimensional representation of the data within the thalamus. Manual labeling of the thalamus provides labels for our UMAP embedding from which k nearest neighbors can be used to label new unseen voxels in that same UMAP embedding. N -fold cross-validation of the method reveals comparable performance to state-of-the-art methods for thalamic parcellation.
    A Sequential Concept Drift Detection Method for On-Device Learning on Low-End Edge Devices. (arXiv:2212.09637v2 [cs.LG] UPDATED)
    A practical issue of edge AI systems is that data distributions of trained dataset and deployed environment may differ due to noise and environmental changes over time. Such a phenomenon is known as a concept drift, and this gap degrades the performance of edge AI systems and may introduce system failures. To address this gap, retraining of neural network models triggered by concept drift detection is a practical approach. However, since available compute resources are strictly limited in edge devices, in this paper we propose a fully sequential concept drift detection method in cooperation with an on-device sequential learning technique of neural networks. In this case, both the neural network retraining and the proposed concept drift detection are done only by sequential computation to reduce computation cost and memory utilization. Evaluation results of the proposed approach shows that while the accuracy is decreased by 3.8%-4.3% compared to existing batch-based detection methods, it decreases the memory size by 88.9%-96.4% and the execution time by 1.3%-83.8%. As a result, the combination of the neural network retraining and the proposed concept drift detection method is demonstrated on Raspberry Pi Pico that has 264kB memory.
    Transformer-based Modeling of Physical Systems: Improved Latent Representations. (arXiv:2210.11269v4 [cs.LG] UPDATED)
    Many phenomena from physics and engineering require highly flexible models, and have ample data with which to fit. However, this data is often irregularly sampled, and cannot be processed as it is by standard deep learning architecture. We propose a transformer-based model for forecasting physical processes at arbitrary spatial points given information on a related process at possibly different points. This architecture is particularly well-suited for high-altitude wind forecasting, as it can effectively leverage large volumes of data recorded along plane trajectories, which are sparse in space. We test at different scales for two different dynamical systems previously studied in the literature: the Poisson equation and Darcy Flow equation. In both cases, our transformer-based model outperforms alternative methods. We hypothesize that this superior performance is due to a more flexible latent representation. To support this hypothesis, we design a simple synthetic experiment to show that the latent representation of the other models suffers from excessive bottlenecking that is, in some cases, preventing the efficient use of the information and slowing training.
    Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions. (arXiv:2301.12250v1 [cs.LG])
    We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given $n$ samples from a (sub-)Gaussian distribution with unknown mean $\mu$ and covariance $\Sigma$, our $(\varepsilon,\delta)$-differentially private estimator produces $\tilde{\mu}$ such that $\|\mu - \tilde{\mu}\|_{\Sigma} \leq \alpha$ as long as $n \gtrsim \tfrac d {\alpha^2} + \tfrac{d \sqrt{\log 1/\delta}}{\alpha \varepsilon}+\frac{d\log 1/\delta}{\varepsilon}$. The Mahalanobis error metric $\|\mu - \hat{\mu}\|_{\Sigma}$ measures the distance between $\hat \mu$ and $\mu$ relative to $\Sigma$; it characterizes the error of the sample mean. Our algorithm runs in time $\tilde{O}(nd^{\omega - 1} + nd/\varepsilon)$, where $\omega < 2.38$ is the matrix multiplication exponent. We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above. Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With $n\gtrsim d^{3/2}$ samples, our estimate is accurate in spectral norm. This is the first such algorithm using $n= o(d^2)$ samples, answering an open question posed by Alabi et al. (2022). With $n\gtrsim d^2$ samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance. Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.  ( 2 min )
    Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. (arXiv:2210.01376v2 [cs.LG] UPDATED)
    We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.
    Coronal Hole Analysis and Prediction using Computer Vision and LSTM Neural Network. (arXiv:2301.06732v2 [astro-ph.SR] UPDATED)
    As humanity has begun to explore space, the significance of space weather has become apparent. It has been established that coronal holes, a type of space weather phenomenon, can impact the operation of aircraft and satellites. The coronal hole is an area on the sun characterized by open magnetic field lines and relatively low temperatures, which result in the emission of the solar wind at higher than average rates. In this study, To prepare for the impact of coronal holes on the Earth, we use computer vision to detect the coronal hole region and calculate its size based on images from the Solar Dynamics Observatory (SDO). We then implement deep learning techniques, specifically the Long Short-Term Memory (LSTM) method, to analyze trends in the coronal hole area data and predict its size for different sun regions over 7 days. By analyzing time series data on the coronal hole area, this study aims to identify patterns and trends in coronal hole behavior and understand how they may impact space weather events. This research represents an important step towards improving our ability to predict and prepare for space weather events that can affect Earth and technological systems.
    Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments. (arXiv:2211.13729v2 [cs.DC] UPDATED)
    With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models. In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.  ( 2 min )
    Binary Classification for High Dimensional Data using Supervised Non-Parametric Ensemble Method. (arXiv:2202.07779v2 [cs.LG] UPDATED)
    High dimensional data for classification does create many difficulties for machine learning algorithms. The generalization can be done using ensemble learning methods such as bagging based supervised non-parametric random forest algorithm. In this paper we solve the problem of binary classification for high dimensional data using random forest for polycystic ovary syndrome dataset. We have performed the implementation and provided a detailed visualization of the data for general inference. The training accuracy that we have achieved is 95.6% and validation accuracy over 91.74% respectively.
    Layer Ensembles. (arXiv:2210.04882v2 [cs.LG] UPDATED)
    Deep Ensembles, as a type of Bayesian Neural Networks, can be used to estimate uncertainty on the prediction of multiple neural networks by collecting votes from each network and computing the difference in those predictions. In this paper, we introduce a method for uncertainty estimation that considers a set of independent categorical distributions for each layer of the network, giving many more possible samples with overlapped layers than in the regular Deep Ensembles. We further introduce an optimized inference procedure that reuses common layer outputs, achieving up to 19x speed up and reducing memory usage quadratically. We also show that the method can be further improved by ranking samples, resulting in models that require less memory and time to run while achieving higher uncertainty quality than Deep Ensembles.
    Level-$k$ Meta-Learning for Pedestrian-Aware Self-Driving. (arXiv:2212.08800v2 [cs.RO] UPDATED)
    The potential market for modern self-driving cars is enormous, as they are developing remarkably rapidly. At the same time, however, cases of pedestrian fatalities caused by autonomous driving have been recorded in the case of crossing the road. In this paper, we propose level-$k$ thinking into MAML to create a Level-$k$ Meta Reinforcement Learning (LK-MRL) as a self-driving vehicle model to prepare for heterogeneous pedestrians and improve intersection safety based on the combination of meta reinforcement learning and human cognitive hierarchy framework. In our evaluation, we assign this model to two different cognitive confrontation hierarchy scenarios in an urban traffic simulator to show not only its demonstrate its advantage in road safety but also the producing ability of higher-level thinking strategies.
    Data Origin Inference in Machine Learning. (arXiv:2211.13416v2 [cs.LG] UPDATED)
    It is a growing direction to utilize unintended memorization in ML models to benefit real-world applications, with recent efforts like user auditing, dataset ownership inference and forgotten data measurement. Standing on the point of ML model development, we introduce a process named data origin inference, to assist ML developers in locating missed or faulty data origin in training set without maintaining strenuous metadata. We formally define the data origin and the data origin inference task in the development of the ML model (mainly neural networks). Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow training. Diverse use cases cover language, visual and structured data, with various kinds of data origin (e.g. business, county, movie, mobile user, text author). A comprehensive performance analysis of our proposed strategy contains referenced target model layers, available testing data for each origin, and in shadow training, the implementations of feature extraction as well as shadow models. Our best inference accuracy achieves 98.96% in the language use case when the target model is a transformer-based deep neural network. Furthermore, we give a statistical analysis of different kinds of data origin to investigate what kind of origin is probably to be inferred correctly.
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v3 [cs.LG] UPDATED)
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.
    Temporal Label Smoothing for Early Event Prediction. (arXiv:2208.13764v2 [cs.LG] UPDATED)
    Models that can predict the occurrence of events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging task is typically treated as a simple binary classification, ignoring temporal dependencies between samples, whereas we propose to exploit this structure. We first introduce a common theoretical framework unifying dynamic survival analysis and early event prediction. Following an analysis of objectives from both fields, we propose Temporal Label Smoothing (TLS), a simpler, yet best-performing method that preserves prediction monotonicity over time. By focusing the objective on areas with a stronger predictive signal, TLS improves performance over all baselines on two large-scale benchmark tasks. Gains are particularly notable along clinically relevant measures, such as event recall at low false-alarm rates. TLS reduces the number of missed events by up to a factor of two over previously used approaches in early event prediction.  ( 2 min )
    Dynamic Network Reconfiguration for Entropy Maximization using Deep Reinforcement Learning. (arXiv:2205.13578v2 [cs.LG] UPDATED)
    A key problem in network theory is how to reconfigure a graph in order to optimize a quantifiable objective. Given the ubiquity of networked systems, such work has broad practical applications in a variety of situations, ranging from drug and material design to telecommunications. The large decision space of possible reconfigurations, however, makes this problem computationally intensive. In this paper, we cast the problem of network rewiring for optimizing a specified structural property as a Markov Decision Process (MDP), in which a decision-maker is given a budget of modifications that are performed sequentially. We then propose a general approach based on the Deep Q-Network (DQN) algorithm and graph neural networks (GNNs) that can efficiently learn strategies for rewiring networks. We then discuss a cybersecurity case study, i.e., an application to the computer network reconfiguration problem for intrusion protection. In a typical scenario, an attacker might have a (partial) map of the system they plan to penetrate; if the network is effectively "scrambled", they would not be able to navigate it since their prior knowledge would become obsolete. This can be viewed as an entropy maximization problem, in which the goal is to increase the surprise of the network. Indeed, entropy acts as a proxy measurement of the difficulty of navigating the network topology. We demonstrate the general ability of the proposed method to obtain better entropy gains than random rewiring on synthetic and real-world graphs while being computationally inexpensive, as well as being able to generalize to larger graphs than those seen during training. Simulations of attack scenarios confirm the effectiveness of the learned rewiring strategies.  ( 3 min )
    Deep Learning-based Spatially Explicit Emulation of an Agent-Based Simulator for Pandemic in a City. (arXiv:2205.14396v2 [cs.MA] UPDATED)
    Agent-Based Models are very useful for simulation of physical or social processes, such as the spreading of a pandemic in a city. Such models proceed by specifying the behavior of individuals (agents) and their interactions, and parameterizing the process of infection based on such interactions based on the geography and demography of the city. However, such models are computationally very expensive, and the complexity is often linear in the total number of agents. This seriously limits the usage of such models for simulations, which often have to be run hundreds of times for policy planning and even model parameter estimation. An alternative is to develop an emulator, a surrogate model that can predict the Agent-Based Simulator's output based on its initial conditions and parameters. In this paper, we discuss a Deep Learning model based on Dilated Convolutional Neural Network that can emulate such an agent based model with high accuracy. We show that use of this model instead of the original Agent-Based Model provides us major gains in the speed of simulations, allowing much quicker calibration to observations, and more extensive scenario analysis. The models we consider are spatially explicit, as the locations of the infected individuals are simulated instead of the gross counts. Another aspect of our emulation framework is its divide-and-conquer approach that divides the city into several small overlapping blocks and carries out the emulation in them parallelly, after which these results are merged together. This ensures that the same emulator can work for a city of any size, and also provides significant improvement of time complexity of the emulator, compared to the original simulator.  ( 2 min )
    BiAdam: Fast Adaptive Bilevel Optimization Methods. (arXiv:2106.11396v3 [math.OC] UPDATED)
    Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and meta learning. Although many bilevel optimization methods recently have been proposed, these methods do not consider using adaptive learning rates. It is well known that adaptive learning rates can accelerate many optimization algorithms including (stochastic) gradient-based algorithms. To fill this gap, in the paper, we propose a novel fast adaptive bilevel framework to solve stochastic bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly convex. Our framework uses unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for the bilevel optimization. Specifically, we propose a fast single-loop adaptive bilevel optimization (BiAdam) algorithm based on the basic momentum technique, which achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding an $\epsilon$-stationary solution (i.e., $\mathbb{E}\|\nabla F(x)\| \leq \epsilon$ or its equivalent variants). Meanwhile, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$ without relying on large batch-size. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates. Some experimental results on data hyper-cleaning and hyper-representation learning tasks demonstrate the efficiency of our algorithms.  ( 2 min )
    Contrastive Credibility Propagation for Reliable Semi-Supervised Learning. (arXiv:2211.09929v2 [cs.LG] UPDATED)
    Inferencing unlabeled data from labeled data is an error-prone process. Conventional neural network training is highly sensitive to supervision errors. These two realities make semi-supervised learning (SSL) troublesome. In practice, SSL approaches often fail to outperform their fully supervised baseline. Proposed is a novel framework for deep SSL via transductive pseudo-label refinement called Contrastive Credibility Propagation (CCP). Through an iterative process of refining soft pseudo-labels, CCP unifies a novel contrastive approach for generating pseudo-labels and a powerful technique to overcome instance-dependent label noise. The result is an SSL classification framework explicitly designed to overcome inevitable pseudo-label errors. Using standard text and image benchmark classification datasets, we show CCP reliably boosts or matches performance over a supervised baseline in four common real-world SSL scenarios: few-label, open-set, noisy-label, and class distribution misalignment.  ( 2 min )
    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. (arXiv:2207.13243v5 [cs.LG] UPDATED)
    The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.  ( 3 min )
    Emergent Linguistic Structures in Neural Networks are Fragile. (arXiv:2210.17406v5 [cs.LG] UPDATED)
    Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this work, we propose a framework and measure of robustness to assess the consistency of linguistic representations against syntax-preserving perturbations. We leverage recent advances in extracting linguistic constructs from LLMs to test the robustness of such structures. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving manipulations. Emergent syntactic representations in neural networks are brittle, thus our work poses the attention on the risk of comparing such structures to those that are object of a long lasting debate in linguistics.
    Interpretable (not just posthoc-explainable) medical claims modeling for discharge placement to prevent avoidable all-cause readmissions or death. (arXiv:2208.12814v3 [cs.CY] UPDATED)
    We developed an inherently interpretable multilevel Bayesian framework for representing variation in regression coefficients that mimics the piecewise linearity of ReLU-activated deep neural networks. We used the framework to formulate a survival model for using medical claims to predict hospital readmission and death that focuses on discharge placement, adjusting for confounding in estimating causal local average treatment effects. We trained the model on a 5% sample of Medicare beneficiaries from 2008 and 2011, based on their 2009--2011 inpatient episodes, and then tested the model on 2012 episodes. The model scored an AUROC of approximately 0.76 on predicting all-cause readmissions -- defined using official Centers for Medicare and Medicaid Services (CMS) methodology -- or death within 30-days of discharge, being competitive against XGBoost and a Bayesian deep neural network, demonstrating that one need-not sacrifice interpretability for accuracy. Crucially, as a regression model, we provide what blackboxes cannot -- the exact gold-standard global interpretation of the model, identifying relative risk factors and quantifying the effect of discharge placement. We also show that the posthoc explainer SHAP fails to provide accurate explanations.  ( 2 min )
    Over-The-Air Federated Learning Over Scalable Cell-free Massive MIMO. (arXiv:2212.06482v2 [eess.SP] UPDATED)
    Cell-free massive MIMO is emerging as a promising technology for future wireless communication systems, which is expected to offer uniform coverage and high spectral efficiency compared to classical cellular systems. We study in this paper how cell-free massive MIMO can support federated edge learning. Taking advantage of the additive nature of the wireless multiple access channel, over-the-air computation is exploited, where the clients send their local updates simultaneously over the same communication resource. This approach, known as over-the-air federated learning (OTA-FL), is proven to alleviate the communication overhead of federated learning over wireless networks. Considering channel correlation and only imperfect channel state information available at the central server, we propose a practical implementation of OTA-FL over cell-free massive MIMO. The convergence of the proposed implementation is studied analytically and experimentally, confirming the benefits of cell-free massive MIMO for OTA-FL.
    Beyond Hawkes: Neural Multi-event Forecasting on Spatio-temporal Point Processes. (arXiv:2211.02922v2 [cs.LG] UPDATED)
    Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. History-dependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast one or multiple future events. In this work, we propose a new neural architecture for simultaneous multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid-19, and Hawkes synthetic pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.
    G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. (arXiv:2205.11796v2 [cs.CV] UPDATED)
    Typical representations for arbitrary-oriented object detection tasks include oriented bounding box (OBB), quadrilateral bounding box (QBB), and point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as the boundary discontinuity, square-like problem, representation ambiguity, and isolated points, which lead to inaccurate detection. Although many effective strategies have been proposed for various representations, there is still no unified solution. Current detection methods based on Gaussian modeling have demonstrated the possibility of breaking this dilemma; however, they remain limited to OBB. To go further, in this paper, we propose a unified Gaussian representation called G-Rep to construct Gaussian distributions for OBB, QBB, and PointSet, which achieves a unified solution to various representations and problems. Specifically, PointSet or QBB-based object representations are converted into Gaussian distributions, and their parameters are optimized using the maximum likelihood estimation algorithm. Then, three optional Gaussian metrics are explored to optimize the regression loss of the detector because of their excellent parameter optimization mechanisms. Furthermore, we also use Gaussian metrics for sampling to align label assignment and regression loss. Experimental results on several public available datasets, such as DOTA, HRSC2016, UCAS-AOD, and ICDAR2015, show the excellent performance of the proposed method for arbitrary-oriented object detection.
    MM-GNN: Mix-Moment Graph Neural Network towards Modeling Neighborhood Feature Distribution. (arXiv:2208.07012v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have shown expressive performance on graph representation learning by aggregating information from neighbors. Recently, some studies have discussed the importance of modeling neighborhood distribution on the graph. However, most existing GNNs aggregate neighbors' features through single statistic (e.g., mean, max, sum), which loses the information related to neighbor's feature distribution and therefore degrades the model performance. In this paper, inspired by the method of moment in statistical theory, we propose to model neighbor's feature distribution with multi-order moments. We design a novel GNN model, namely Mix-Moment Graph Neural Network (MM-GNN), which includes a Multi-order Moment Embedding (MME) module and an Element-wise Attention-based Moment Adaptor module. MM-GNN first calculates the multi-order moments of the neighbors for each node as signatures, and then use an Element-wise Attention-based Moment Adaptor to assign larger weights to important moments for each node and update node representations. We conduct extensive experiments on 15 real-world graphs (including social networks, citation networks and web-page networks etc.) to evaluate our model, and the results demonstrate the superiority of MM-GNN over existing state-of-the-art models.
    Graphically Structured Diffusion Models. (arXiv:2210.11633v2 [cs.LG] UPDATED)
    We introduce a framework for automatically defining and learning deep generative models with problem-specific structure. We tackle problem domains that are more traditionally solved by algorithms such as sorting, constraint satisfaction for Sudoku, and matrix factorization. Concretely, we train diffusion models with an architecture tailored to the problem specification. This problem specification should contain a graphical model describing relationships between variables, and often benefits from explicit representation of subcomputations. Permutation invariances can also be exploited. Across a diverse set of experiments we improve the scaling relationship between problem dimension and our model's performance, in terms of both training time and final accuracy.  ( 2 min )
    DyFormer: A Scalable Dynamic Graph Transformer with Provable Benefits on Generalization Ability. (arXiv:2111.10447v3 [cs.LG] UPDATED)
    Transformers have achieved great success in several domains, including Natural Language Processing and Computer Vision. However, its application to real-world graphs is less explored, mainly due to its high computation cost and its poor generalizability caused by the lack of enough training data in the graph domain. To fill in this gap, we propose a scalable Transformer-like dynamic graph learning method named Dynamic Graph Transformer (DyFormer) with spatial-temporal encoding to effectively learn graph topology and capture implicit links. To achieve efficient and scalable training, we propose temporal-union graph structure and its associated subgraph-based node sampling strategy. To improve the generalization ability, we introduce two complementary self-supervised pre-training tasks and show that jointly optimizing the two pre-training tasks results in a smaller Bayesian error rate via an information-theoretic analysis. Extensive experiments on the real-world datasets illustrate that DyFormer achieves a consistent 1%-3% AUC gain (averaged over all time steps) compared with baselines on all benchmarks.  ( 2 min )
    FED-CD: Federated Causal Discovery from Interventional and Observational Data. (arXiv:2211.03846v2 [cs.LG] UPDATED)
    Causal discovery, the inference of causal relations from data, is a core task of fundamental importance in all scientific domains, and several new machine learning methods for addressing the causal discovery problem have been proposed recently. However, existing machine learning methods for causal discovery typically require that the data used for inference is pooled and available in a centralized location. In many domains of high practical importance, such as in healthcare, data is only available at local data-generating entities (e.g. hospitals in the healthcare context), and cannot be shared across entities due to, among others, privacy and regulatory reasons. In this work, we address the problem of inferring causal structure - in the form of a directed acyclic graph (DAG) - from a distributed data set that contains both observational and interventional data in a privacy-preserving manner by exchanging updates instead of samples. To this end, we introduce a new federated framework, FED-CD, that enables the discovery of global causal structures both when the set of intervened covariates is the same across decentralized entities, and when the set of intervened covariates are potentially disjoint. We perform a comprehensive experimental evaluation on synthetic data that demonstrates that FED-CD enables effective aggregation of decentralized data for causal discovery without direct sample sharing, even when the contributing distributed data sets cover disjoint sets of interventions. Effective methods for causal discovery in distributed data sets could significantly advance scientific discovery and knowledge sharing in important settings, for instance, healthcare, in which sharing of data across local sites is difficult or prohibited.  ( 2 min )
    Fair and Optimal Classification via Post-Processing Predictors. (arXiv:2211.01528v2 [cs.LG] UPDATED)
    To address the bias exhibited by machine learning models, fairness criteria impose statistical constraints for ensuring equal treatment to all demographic groups, but typically at a cost to model performance. Understanding this tradeoff, therefore, underlies the design of fair and effective algorithms. This paper completes the characterization of the inherent tradeoff of demographic parity on classification problems in the most general multigroup, multiclass, and noisy setting. Specifically, we show that the minimum error rate is given by the optimal value of a Wasserstein-barycenter problem. More practically, this reformulation leads to a simple procedure for post-processing any pre-trained predictors to satisfy demographic parity in the general setting, which, in particular, yields the optimal fair classifier when applied to the Bayes predictor. We provide suboptimality and finite sample analyses for our procedure, and demonstrate precise control of the tradeoff of error rate for fairness on real-world datasets provided sufficient data.  ( 2 min )
    Revisiting Over-smoothing and Over-squashing using Ollivier-Ricci Curvature. (arXiv:2211.15779v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature, while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.
    Hard Sample Aware Network for Contrastive Deep Graph Clustering. (arXiv:2212.08665v3 [cs.LG] UPDATED)
    Contrastive deep graph clustering, which aims to divide nodes into disjoint groups via contrastive mechanisms, is a challenging research spot. Among the recent works, hard sample mining-based algorithms have achieved great attention for their promising performance. However, we find that the existing hard sample mining methods have two problems as follows. 1) In the hardness measurement, the important structural information is overlooked for similarity calculation, degrading the representativeness of the selected hard negative samples. 2) Previous works merely focus on the hard negative sample pairs while neglecting the hard positive sample pairs. Nevertheless, samples within the same cluster but with low similarity should also be carefully learned. To solve the problems, we propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN) by introducing a comprehensive similarity measure criterion and a general dynamic sample weighing strategy. Concretely, in our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings, better revealing sample relationships and assisting hardness measurement. Moreover, under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples and then dynamically up-weight the hard sample pairs while down-weighting the easy ones. In this way, our method can mine not only the hard negative samples but also the hard positive sample, thus improving the discriminative capability of the samples further. Extensive experiments and analyses demonstrate the superiority and effectiveness of our proposed method.
    PrivHAR: Recognizing Human Actions From Privacy-preserving Lens. (arXiv:2206.03891v2 [cs.CV] UPDATED)
    The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection along the human action recognition pipeline. Our framework parameterizes the camera lens to successfully degrade the quality of the videos to inhibit privacy attributes and protect against adversarial attacks while maintaining relevant features for activity recognition. We validate our approach with extensive simulations and hardware experiments.
    Rethinking skip connection model as a learnable Markov chain. (arXiv:2209.15278v2 [cs.LG] UPDATED)
    Over past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which can be formulated as a learnable Markov chain. An efficient Markov chain is preferred as it always maps the input data to the target domain in a better way. However, while a model is explained as a Markov chain, it is not guaranteed to be optimized following an efficient Markov chain by existing SGD-based optimizers which are prone to get trapped in local optimal points. In order to towards a more efficient Markov chain, we propose a simple routine of penal connection to make any residual-like model become a learnable Markov chain. Aside from that, the penal connection can also be viewed as a particular model regularization and can be easily implemented with one line of code in the most popular deep learning frameworks~\footnote{Source code: \url{https://github.com/densechen/penal-connection}}. The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection.
    What Is Fairness? Philosophical Considerations and Implications For FairML. (arXiv:2205.09622v2 [cs.LG] UPDATED)
    A growing body of literature in fairness-aware ML (fairML) aspires to mitigate machine learning (ML)-related unfairness in automated decision making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods that ensure that trained ML models achieve low values in those measures. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a considerable gap between centuries of philosophical discussion and recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the training and evaluation of ML models in ADM systems. We derive that fairness problems can already arise without the presence of protected attributes, pointing out that fairness and predictive performance are not irreconcilable counterparts, but rather that the latter is necessary to achieve the former. Moreover, we argue why and how causal considerations are necessary when assessing fairness in the presence of protected attributes. We achieve greater linguistic clarity for the discussion of fairML and propose general algorithms for practical applications.
    Sequence Learning using Equilibrium Propagation. (arXiv:2209.09626v2 [cs.NE] UPDATED)
    Equilibrium Propagation (EP) is a powerful and more bio-plausible alternative to conventional learning frameworks such as backpropagation. The effectiveness of EP stems from the fact that it relies only on local computations and requires solely one kind of computational unit during both of its training phases, thereby enabling greater applicability in domains such as bio-inspired neuromorphic computing. The dynamics of the model in EP is governed by an energy function and the internal states of the model consequently converge to a steady state following the state transition rules defined by the same. However, by definition, EP requires the input to the model (a convergent RNN) to be static in both the phases of training. Thus it is not possible to design a model for sequence classification using EP with an LSTM or GRU like architecture. In this paper, we leverage recent developments in modern hopfield networks to further understand energy based models and develop solutions for complex sequence classification tasks using EP while satisfying its convergence criteria and maintaining its theoretical similarities with recurrent backpropagation. We explore the possibility of integrating modern hopfield networks as an attention mechanism with convergent RNN models used in EP, thereby extending its applicability for the first time on two different sequence classification tasks in natural language processing viz. sentiment analysis (IMDB dataset) and natural language inference (SNLI dataset).
    Large-scale Model Personalization via Low Rank and Sparse decomposition. (arXiv:2210.03505v2 [cs.LG] UPDATED)
    Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation style systems. Standard personalization approaches involve learning a user/domain specific embedding that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain -- a.k.a meta-learning -- has high storage/infrastructure cost. We propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse matrices. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. Furthermore, the framework is up to two orders of magnitude more scalable (in terms of storage/infrastructure cost) than user-specific finetuning of model. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with iterative hard thresholding -- AMHT-LRS -- to learn the low-rank and sparse part. For the realizable, Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal samples. A significant challenge in personalization is ensuring privacy of each user's sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees. Finally, on multiple standard recommendation datasets, we demonstrate that our approach allows personalized models to obtain superior performance in sparse data regime.
    Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization. (arXiv:2212.05088v2 [math.OC] UPDATED)
    Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w.r.t.~a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result when a Polyak-{\L}ojasiewicz (P{\L}) condition holds. To our knowledge, this work is the first to provide non-asymptotic convergence guarantees -- variance-reduced or not -- for a cyclic block coordinate method in general composite (smooth + nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in training deep neural nets.
    Accelerating Kernel Classifiers Through Borders Mapping. (arXiv:1708.05917v6 [stat.ML] UPDATED)
    Support vector machines (SVM) and other kernel techniques represent a family of powerful statistical classification methods with high accuracy and broad applicability. Because they use all or a significant portion of the training data, however, they can be slow, especially for large problems. Piecewise linear classifiers are similarly versatile, yet have the additional advantages of simplicity, ease of interpretation and, if the number of component linear classifiers is not too large, speed. Here we show how a simple, piecewise linear classifier can be trained from a kernel-based classifier in order to improve the classification speed. The method works by finding the root of the difference in conditional probabilities between pairs of opposite classes to build up a representation of the decision boundary. When tested on 17 different datasets, it succeeded in improving the classification speed of a SVM for 12 of them by up to two orders-of-magnitude. Of these, two were less accurate than a simple, linear classifier. The method is best suited to problems with continuum features data and smooth probability functions. Because the component linear classifiers are built up individually from an existing classifier, rather than through a simultaneous optimization procedure, the classifier is also fast to train.
    Dexterous Robotic Manipulation using Deep Reinforcement Learning and Knowledge Transfer for Complex Sparse Reward-based Tasks. (arXiv:2205.09683v2 [cs.RO] UPDATED)
    This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142{\deg} to 76{\deg} during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.
    FETA: Fairness Enforced Verifying, Training, and Predicting Algorithms for Neural Networks. (arXiv:2206.00553v2 [cs.LG] UPDATED)
    Algorithmic decision making driven by neural networks has become very prominent in applications that directly affect people's quality of life. In this paper, we study the problem of verifying, training, and guaranteeing individual fairness of neural network models. A popular approach for enforcing fairness is to translate a fairness notion into constraints over the parameters of the model. However, such a translation does not always guarantee fair predictions of the trained neural network model. To address this challenge, we develop a counterexample-guided post-processing technique to provably enforce fairness constraints at prediction time. Contrary to prior work that enforces fairness only on points around test or train data, we are able to enforce and guarantee fairness on all points in the input domain. Additionally, we propose an in-processing technique to use fairness as an inductive bias by iteratively incorporating fairness counterexamples in the learning process. We have implemented these techniques in a tool called FETA. Empirical evaluation on real-world datasets indicates that FETA is not only able to guarantee fairness on-the-fly at prediction time but also is able to train accurate models exhibiting a much higher degree of individual fairness.
    Context-Aware Differential Privacy for Language Modeling. (arXiv:2301.12288v1 [cs.LG])
    The remarkable ability of language models (LMs) has also brought challenges at the interface of AI and security. A critical challenge pertains to how much information these models retain and leak about the training data. This is particularly urgent as the typical development of LMs relies on huge, often highly sensitive data, such as emails and chat logs. To contrast this shortcoming, this paper introduces Context-Aware Differentially Private Language Model (CADP-LM) , a privacy-preserving LM framework that relies on two key insights: First, it utilizes the notion of \emph{context} to define and audit the potentially sensitive information. Second, it adopts the notion of Differential Privacy to protect sensitive information and characterize the privacy leakage. A unique characteristic of CADP-LM is its ability to target the protection of sensitive sentences and contexts only, providing a highly accurate private model. Experiments on a variety of datasets and settings demonstrate these strengths of CADP-LM.
    Double Sampling Randomized Smoothing. (arXiv:2206.07912v4 [cs.LG] UPDATED)
    Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as shown by previous work, the certified robust radius in randomized smoothing suffers from scaling to large datasets ("curse of dimensionality"). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius under $\ell_2$ norm where $d$ is the input dimension, implying that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and ImageNet verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https://github.com/llylly/DSRS.
    Composing Task Knowledge with Modular Successor Feature Approximators. (arXiv:2301.12305v1 [cs.LG])
    Recently, the Successor Features and Generalized Policy Improvement (SF&GPI) framework has been proposed as a method for learning, composing, and transferring predictive knowledge and behavior. SF&GPI works by having an agent learn predictive representations (SFs) that can be combined for transfer to new tasks with GPI. However, to be effective this approach requires state features that are useful to predict, and these state-features are typically hand-designed. In this work, we present a novel neural network architecture, "Modular Successor Feature Approximators" (MSFA), where modules both discover what is useful to predict, and learn their own predictive representations. We show that MSFA is able to better generalize compared to baseline architectures for learning SFs and modular architectures
    Noisy intermediate-scale quantum algorithm for semidefinite programming. (arXiv:2106.03891v3 [quant-ph] UPDATED)
    Semidefinite programs (SDPs) are convex optimization programs with vast applications in control theory, quantum information, combinatorial optimization and operational research. Noisy intermediate-scale quantum (NISQ) algorithms aim to make an efficient use of the current generation of quantum hardware. However, optimizing variational quantum algorithms is a challenge as it is an NP-hard problem that in general requires an exponential time to solve and can contain many far from optimal local minima. Here, we present a current term NISQ algorithm for solving SDPs. The classical optimization program of our NISQ solver is another SDP over a lower dimensional ansatz space. We harness the SDP based formulation of the Hamiltonian ground state problem to design a NISQ eigensolver. Unlike variational quantum eigensolvers, the classical optimization program of our eigensolver is convex, can be solved in polynomial time with the number of ansatz parameters and every local minimum is a global minimum. We find numeric evidence that NISQ SDP can improve the estimation of ground state energies in a scalable manner. Further, we efficiently solve constrained problems to calculate the excited states of Hamiltonians, find the lowest energy of symmetry constrained Hamiltonians and determine the optimal measurements for quantum state discrimination. We demonstrate the potential of our approach by finding the largest eigenvalue of up to $2^{1000}$ dimensional matrices and solving graph problems related to quantum contextuality. We also discuss NISQ algorithms for rank-constrained SDPs. Our work extends the application of NISQ computers onto one of the most successful algorithmic frameworks of the past few decades.
    Team Resilience under Shock: An Empirical Analysis of GitHub Repositories during Early COVID-19 Pandemic. (arXiv:2301.12326v1 [cs.LG])
    While many organizations have shifted to working remotely during the COVID-19 pandemic, how the remote workforce and the remote teams are influenced by and would respond to this and future shocks remain largely unknown. Software developers have relied on remote collaborations long before the pandemic, working in virtual teams (GitHub repositories). The dynamics of these repositories through the pandemic provide a unique opportunity to understand how remote teams react under shock. This work presents a systematic analysis. We measure the overall effect of the early pandemic on public GitHub repositories by comparing their sizes and productivity with the counterfactual outcomes forecasted as if there were no pandemic. We find that the productivity level and the number of active members of these teams vary significantly during different periods of the pandemic. We then conduct a finer-grained investigation and study the heterogeneous effects of the shock on individual teams. We find that the resilience of a team is highly correlated to certain properties of the team before the pandemic. Through a bootstrapped regression analysis, we reveal which types of teams are robust or fragile to the shock.
    Online Allocation Problem with Two-sided Resource Constraints. (arXiv:2112.13964v3 [cs.LG] UPDATED)
    In this paper, we investigate the online allocation problem of maximizing the overall revenue subject to both lower and upper bound constraints. Compared to the extensively studied online problems with only resource upper bounds, the two-sided constraints affect the prospects of resource consumption more severely. As a result, only limited violations of constraints or pessimistic competitive bounds could be guaranteed. To tackle the challenge, we define a measure of feasibility $\xi^*$ to evaluate the hardness of this problem, and estimate this measurement by an optimization routine with theoretical guarantees. We propose an online algorithm adopting a constructive framework, where we initialize a threshold price vector using the estimation, then dynamically update the price vector and use it for decision-making at each step. It can be shown that the proposed algorithm is $\big(1-O(\frac{\varepsilon}{\xi^*-\varepsilon})\big)$ or $\big(1-O(\frac{\varepsilon}{\xi^*-\sqrt{\varepsilon}})\big)$ competitive with high probability for $\xi^*$ known or unknown respectively. To the best of our knowledge, this is the first result establishing a nearly optimal competitive algorithm for solving two-sided constrained online allocation problems with a high probability of feasibility.
    Concept-based Explanations for Out-Of-Distribution Detectors. (arXiv:2203.02586v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.
    Federated Learning in Satellite Constellations. (arXiv:2206.00307v2 [cs.IT] UPDATED)
    Federated learning (FL) has recently emerged as a distributed machine learning paradigm for systems with limited and intermittent connectivity. This paper presents the new context brought to FL by satellite constellations, where the connectivity patterns are significantly different from the ones observed in conventional terrestrial FL. The focus is on large constellations in low earth orbit (LEO), where each satellites participates in a data-driven FL task using a locally stored dataset. This scenario is motivated by the trend towards mega constellations of interconnected small satellites in LEO and the integration of artificial intelligence in satellites. We propose a classification of satellite FL based on the communication capabilities of the satellites, the constellation design, and the location of the parameter server. A comprehensive overview of the current state-of-the-art in this field is provided and the unique challenges and opportunities of satellite FL are discussed. Finally, we outline several open research directions for FL in satellite constellations and present some future perspectives on this topic.
    Adversarial Learning Networks: Source-free Unsupervised Domain Incremental Learning. (arXiv:2301.12054v1 [cs.LG])
    This work presents an approach for incrementally updating deep neural network (DNN) models in a non-stationary environment. DNN models are sensitive to changes in input data distribution, which limits their application to problem settings with stationary input datasets. In a non-stationary environment, updating a DNN model requires parameter re-training or model fine-tuning. We propose an unsupervised source-free method to update DNN classification models. The contributions of this work are two-fold. First, we use trainable Gaussian prototypes to generate representative samples for future iterations; second, using unsupervised domain adaptation, we incrementally adapt the existing model using unlabelled data. Unlike existing methods, our approach can update a DNN model incrementally for non-stationary source and target tasks without storing past training data. We evaluated our work on incremental sentiment prediction and incremental disease prediction applications and compared our approach to state-of-the-art continual learning, domain adaptation, and ensemble learning methods. Our results show that our approach achieved improved performance compared to existing incremental learning methods. We observe minimal forgetting of past knowledge over many iterations, which can help us develop unsupervised self-learning systems.
    Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation. (arXiv:2208.12401v3 [cs.LG] UPDATED)
    Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework.
    Continual Learning by Modeling Intra-Class Variation. (arXiv:2210.05398v2 [cs.LG] UPDATED)
    It has been observed that neural networks perform poorly when the data or tasks are presented sequentially. Unlike humans, neural networks suffer greatly from catastrophic forgetting, making it impossible to perform life-long learning. To address this issue, memory-based continual learning has been actively studied and stands out as one of the best-performing methods. We examine memory-based continual learning and identify that large variation in the representation space is crucial for avoiding catastrophic forgetting. Motivated by this, we propose to diversify representations by using two types of perturbations: model-agnostic variation (i.e., the variation is generated without the knowledge of the learned neural network) and model-based variation (i.e., the variation is conditioned on the learned neural network). We demonstrate that enlarging representational variation serves as a general principle to improve continual learning. Finally, we perform empirical studies which demonstrate that our method, as a simple plug-and-play component, can consistently improve a number of memory-based continual learning methods by a large margin.
    Applications of Generative Adversarial Networks in Neuroimaging and Clinical Neuroscience. (arXiv:2206.07081v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) are one powerful type of deep learning models that have been successfully utilized in numerous fields. They belong to a broader family called generative methods, which generate new data with a probabilistic model by learning sample distribution from real examples. In the clinical context, GANs have shown enhanced capabilities in capturing spatially complex, nonlinear, and potentially subtle disease effects compared to traditional generative methods. This review appraises the existing literature on the applications of GANs in imaging studies of various neurological conditions, including Alzheimer's disease, brain tumors, brain aging, and multiple sclerosis. We provide an intuitive explanation of various GAN methods for each application and further discuss the main challenges, open questions, and promising future directions of leveraging GANs in neuroimaging. We aim to bridge the gap between advanced deep learning methods and neurology research by highlighting how GANs can be leveraged to support clinical decision making and contribute to a better understanding of the structural and functional patterns of brain diseases.
    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v2 [stat.ML] UPDATED)
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.
    Learning Mixtures of Markov Chains and MDPs. (arXiv:2211.09403v2 [stat.ML] UPDATED)
    We present an algorithm for learning mixtures of Markov chains and Markov decision processes (MDPs) from short unlabeled trajectories. Specifically, our method handles mixtures of Markov chains with optional control input by going through a multi-step process, involving (1) a subspace estimation step, (2) spectral clustering of trajectories using "pairwise distance estimators," along with refinement using the EM algorithm, (3) a model estimation step, and (4) a classification step for predicting labels of new trajectories. We provide end-to-end performance guarantees, where we only explicitly require the length of trajectories to be linear in the number of states and the number of trajectories to be linear in a mixing time parameter. Experimental results support these guarantees, where we attain 96.6% average accuracy on a mixture of two MDPs in gridworld, outperforming the EM algorithm with random initialization (73.2% average accuracy).
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v2 [cs.LG] UPDATED)
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.
    Progressive Prompts: Continual Learning for Language Models. (arXiv:2301.12314v1 [cs.CL])
    We introduce Progressive Prompts - a simple and efficient approach for continual learning in language models. Our method allows forward transfer and resists catastrophic forgetting, without relying on data replay or a large number of task-specific parameters. Progressive Prompts learns a new soft prompt for each task and sequentially concatenates it with the previously learned prompts, while keeping the base model frozen. Experiments on standard continual learning benchmarks show that our approach outperforms state-of-the-art methods, with an improvement >20% in average test accuracy over the previous best-preforming method on T5 model. We also explore a more challenging continual learning setup with longer sequences of tasks and show that Progressive Prompts significantly outperforms prior methods.
    Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows. (arXiv:2211.11096v2 [cs.LG] UPDATED)
    Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.
    Laplacian-based Semi-Supervised Learning in Multilayer Hypergraphs by Coordinate Descent. (arXiv:2301.12184v1 [cs.LG])
    Graph Semi-Supervised learning is an important data analysis tool, where given a graph and a set of labeled nodes, the aim is to infer the labels to the remaining unlabeled nodes. In this paper, we start by considering an optimization-based formulation of the problem for an undirected graph, and then we extend this formulation to multilayer hypergraphs. We solve the problem using different coordinate descent approaches and compare the results with the ones obtained by the classic gradient descent method. Experiments on synthetic and real-world datasets show the potential of using coordinate descent methods with suitable selection rules.
    Deep Metric Learning with Chance Constraints. (arXiv:2209.09060v2 [cs.CV] UPDATED)
    Deep metric learning (DML) aims to minimize empirical expected loss of the pairwise intra-/inter- class proximity violations in the embedding image. We relate DML to feasibility problem of finite chance constraints. We show that minimizer of proxy-based DML satisfies certain chance constraints, and that the worst case generalization performance of the proxy-based methods can be characterized by the radius of the smallest ball around a class proxy to cover the entire domain of the corresponding class samples, suggesting multiple proxies per class helps performance. To provide a scalable algorithm as well as exploiting more proxies, we consider the chance constraints implied by the minimizers of proxy-based DML instances and reformulate DML as finding a feasible point in intersection of such constraints, resulting in a problem to be approximately solved by iterative projections. Simply put, we repeatedly train a regularized proxy-based loss and re-initialize the proxies with the embeddings of the deliberately selected new samples. We apply our method with the well-accepted losses and evaluate on four popular benchmark datasets for image retrieval. Outperforming state-of-the-art, our method consistently improves the performance of the applied losses. Code is available at: https://github.com/yetigurbuz/ccp-dml
    TemporAI: Facilitating Machine Learning Innovation in Time Domain Tasks for Medicine. (arXiv:2301.12260v1 [cs.LG])
    TemporAI is an open source Python software library for machine learning (ML) tasks involving data with a time component, focused on medicine and healthcare use cases. It supports data in time series, static, and eventmodalities and provides an interface for prediction, causal inference, and time-to-event analysis, as well as common preprocessing utilities and model interpretability methods. The library aims to facilitate innovation in the medical ML space by offering a standardized temporal setting toolkit for model development, prototyping and benchmarking, bridging the gaps in the ML research, healthcare professional, medical/pharmacological industry, and data science communities. TemporAI is available on GitHub (https://github.com/vanderschaarlab/temporai) and we welcome community engagement through use, feedback, and code contributions.
    Factor-augmented tree ensembles. (arXiv:2111.14000v4 [stat.ML] UPDATED)
    This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.
    Understanding Hindsight Goal Relabeling from a Divergence Minimization Perspective. (arXiv:2209.13046v2 [cs.LG] UPDATED)
    Hindsight goal relabeling has become a foundational technique in multi-goal reinforcement learning (RL). The essential idea is that any trajectory can be seen as a sub-optimal demonstration for reaching its final state. Intuitively, learning from those arbitrary demonstrations can be seen as a form of imitation learning (IL). However, the connection between hindsight goal relabeling and imitation learning is not well understood. In this paper, we propose a novel framework to understand hindsight goal relabeling from a divergence minimization perspective. Recasting the goal reaching problem in the IL framework not only allows us to derive several existing methods from first principles, but also provides us with the tools from IL to improve goal reaching algorithms. Experimentally, we find that under hindsight relabeling, Q-learning outperforms behavioral cloning (BC). Yet, a vanilla combination of both hurts performance. Concretely, we see that the BC loss only helps when selectively applied to actions that get the agent closer to the goal according to the Q-function. Our framework also explains the puzzling phenomenon wherein a reward of (-1, 0) results in significantly better performance than a (0, 1) reward for goal reaching.
    Does Federated Learning Really Need Backpropagation?. (arXiv:2301.12195v1 [cs.LG])
    Federated learning (FL) is a general principle for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. Empirically we use BAFFLE to train deep models from scratch or to finetune pretrained models, achieving acceptable results. Code is available in https://github.com/FengHZ/BAFFLE.
    Learning Effective SDEs from Brownian Dynamics Simulations of Colloidal Particles. (arXiv:2205.00286v3 [math.DS] UPDATED)
    We construct a reduced, data-driven, parameter dependent effective Stochastic Differential Equation (eSDE) for electric-field mediated colloidal crystallization using data obtained from Brownian Dynamics Simulations. We use Diffusion Maps (a manifold learning algorithm) to identify a set of useful latent observables. In this latent space we identify an eSDE using a deep learning architecture inspired by numerical stochastic integrators and compare it with the traditional Kramers-Moyal expansion estimation. We show that the obtained variables and the learned dynamics accurately encode the physics of the Brownian Dynamic Simulations. We further illustrate that our reduced model captures the dynamics of corresponding experimental data. Our dimension reduction/reduced model identification approach can be easily ported to a broad class of particle systems dynamics experiments/models.
    Hierarchical clustering: visualization, feature importance and model selection. (arXiv:2112.01372v2 [stat.ME] UPDATED)
    We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through an evolutionary model. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes and gives more insights than state-of-art approaches. We provide an R package that implements our methods.  ( 2 min )
    ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs. (arXiv:2210.03052v2 [cs.LG] UPDATED)
    Transformer is the cornerstone model of Natural Language Processing (NLP) over the past decade. Despite its great success in Deep Learning (DL) applications, the increasingly growing parameter space required by transformer models boosts the demand on accelerating the performance of transformer models. In addition, NLP problems can commonly be faced with variable-length sequences since their word numbers can vary among sentences. Existing DL frameworks need to pad variable-length sequences to the maximal length, which, however, leads to significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a zero padding algorithm that enables the whole transformer to be free from redundant computations on useless padded tokens. Besides the algorithmic level optimization, we provide architectural-aware optimizations for transformer functioning modules, especially the performance-critical algorithm, multi-head attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA (FMHA) outperforms the standard PyTorch MHA by 6.13X. The end-to-end performance of ByteTransformer for a standard BERT transformer model surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer and NVIDIA FasterTransformer, by 87\%, 131\%, 138\% and 46\%, respectively.  ( 2 min )
    Online Self-Concordant and Relatively Smooth Minimization, With Applications to Online Portfolio Selection and Learning Quantum States. (arXiv:2210.00997v2 [stat.ML] UPDATED)
    Consider an online convex optimization problem where the loss functions are self-concordant barriers, smooth relative to a convex function $h$, and possibly non-Lipschitz. We analyze the regret of online mirror descent with $h$. Then, based on the result, we prove the following in a unified manner. Denote by $T$ the time horizon and $d$ the parameter dimension. 1. For online portfolio selection, the regret of $\widetilde{\text{EG}}$, a variant of exponentiated gradient due to Helmbold et al., is $\tilde{O} ( T^{2/3} d^{1/3} )$ when $T > 4 d / \log d$. This improves on the original $\tilde{O} ( T^{3/4} d^{1/2} )$ regret bound for $\widetilde{\text{EG}}$. 2. For online portfolio selection, the regret of online mirror descent with the logarithmic barrier is $\tilde{O}(\sqrt{T d})$. The regret bound is the same as that of Soft-Bayes due to Orseau et al. up to logarithmic terms. 3. For online learning quantum states with the logarithmic loss, the regret of online mirror descent with the log-determinant function is also $\tilde{O} ( \sqrt{T d} )$. Its per-iteration time is shorter than all existing algorithms we know.  ( 2 min )
    Discovering Limitations of Image Quality Assessments with Noised Deep Learning Image Sets. (arXiv:2210.10249v2 [cs.CV] UPDATED)
    Image quality is important, and can affect overall performance in image processing and computer vision as well as for numerous other reasons. Image quality assessment (IQA) is consequently a vital task in different applications from aerial photography interpretation to object detection to medical image analysis. In previous research, the BRISQUE algorithm and the PSNR algorithm were evaluated with high resolution (atleast 512x384 pixels), but relatively small image sets (no more than 4,744 images). However, scientists have not evaluated IQA algorithms on low resolution (no more than 32x32 pixels), multi-perturbation, big image sets (for example, tleast 60,000 different images not counting their perturbations). This study explores these two IQA algorithms through experimental investigation. We first chose two deep learning image sets, CIFAR-10 and MNIST. Then, we added 68 perturbations that add noise to the images in specific sequences and noise intensities. In addition, we tracked the performance outputs of the two IQA algorithms with singly and multiply noised images. After quantitatively analyzing experimental results, we report the limitations of the two IQAs with these noised CIFAR-10 and MNIST image sets. We also explain three potential root causes for performance degradation. These findings point out weaknesses of the two IQA algorithms. The research results provide guidance to scientists and engineers developing accurate, robust IQA algorithms. All source codes, related image sets, and figures are shared on the website (https://github.com/caperock/imagequality) to support future scientific and industrial projects.  ( 2 min )
    Policy-Adaptive Estimator Selection for Off-Policy Evaluation. (arXiv:2211.13904v2 [cs.LG] UPDATED)
    Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic.  ( 2 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models. (arXiv:2301.12334v1 [cs.LG])
    We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating sufficient numbers of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves highly ineffective and time-consuming for the task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first provide a new insight on the majority-focused nature of the diffusion models: they denoise in favor of the majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating the low-likelihood minority samples over existing generative frameworks including the standard diffusion sampler.  ( 2 min )
    Quantum Machine Learning for Decentralized Quantum Protocols with Local Operations and Noisy Classical Communications. (arXiv:2207.11354v2 [quant-ph] UPDATED)
    Distributed quantum information processing protocols such as quantum entanglement distillation and quantum state discrimination rely on local operations and classical communications (LOCC). Existing LOCC-based protocols typically assume the availability of ideal, noiseless, communication channels. In this paper, we study the case in which classical communication takes place over noisy channels, and we propose to address the design of LOCC protocols in this setting via the use of quantum machine learning tools. We specifically focus on the important tasks of quantum entanglement distillation and quantum state discrimination, and implement local processing through parameterized quantum circuits (PQCs) that are optimized to maximize the average fidelity and average success probability in the respective tasks, while accounting for communication errors. The introduced approach, Noise Aware-LOCCNet (NA-LOCCNet), is shown to have significant advantages over existing protocols designed for noiseless communications.  ( 2 min )
    Efficient Enumeration of Markov Equivalent DAGs. (arXiv:2301.12212v1 [cs.AI])
    Enumerating the directed acyclic graphs (DAGs) of a Markov equivalence class (MEC) is an important primitive in causal analysis. The central resource from the perspective of computational complexity is the delay, that is, the time an algorithm that lists all members of the class requires between two consecutive outputs. Commonly used algorithms for this task utilize the rules proposed by Meek (1995) or the transformational characterization by Chickering (1995), both resulting in superlinear delay. In this paper, we present the first linear-time delay algorithm. On the theoretical side, we show that our algorithm can be generalized to enumerate DAGs represented by models that incorporate background knowledge, such as MPDAGs; on the practical side, we provide an efficient implementation and evaluate it in a series of experiments. Complementary to the linear-time delay algorithm, we also provide intriguing insights into Markov equivalence itself: All members of an MEC can be enumerated such that two successive DAGs have structural Hamming distance at most three.  ( 2 min )
    Large Language Models are Zero-Shot Reasoners. (arXiv:2205.11916v4 [cs.CL] UPDATED)
    Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.  ( 3 min )
    Learning Locality and Isotropy in Dialogue Modeling. (arXiv:2205.14583v2 [cs.CL] UPDATED)
    Existing dialogue modeling methods have achieved promising performance on various dialogue tasks with the aid of Transformer and the large-scale pre-trained language models. However, some recent studies revealed that the context representations produced by these methods suffer the problem of anisotropy. In this paper, we find that the generated representations are also not conversational, losing the conversation structure information during the context modeling stage. To this end, we identify two properties in dialogue modeling, i.e., locality and isotropy, and present a simple method for dialogue representation calibration, namely SimDRC, to build isotropic and conversational feature spaces. Experimental results show that our approach significantly outperforms the current state-of-the-art models on three dialogue tasks across the automatic and human evaluation metrics. More in-depth analyses further confirm the effectiveness of our proposed approach.
    Scalable and Equivariant Spherical CNNs by Discrete-Continuous (DISCO) Convolutions. (arXiv:2209.13603v3 [cs.CV] UPDATED)
    No existing spherical convolutional neural network (CNN) framework is both computationally scalable and rotationally equivariant. Continuous approaches capture rotational equivariance but are often prohibitively computationally demanding. Discrete approaches offer more favorable computational performance but at the cost of equivariance. We develop a hybrid discrete-continuous (DISCO) group convolution that is simultaneously equivariant and computationally scalable to high-resolution. While our framework can be applied to any compact group, we specialize to the sphere. Our DISCO spherical convolutions exhibit $\text{SO}(3)$ rotational equivariance, where $\text{SO}(n)$ is the special orthogonal group representing rotations in $n$-dimensions. When restricting rotations of the convolution to the quotient space $\text{SO}(3)/\text{SO}(2)$ for further computational enhancements, we recover a form of asymptotic $\text{SO}(3)$ rotational equivariance. Through a sparse tensor implementation we achieve linear scaling in number of pixels on the sphere for both computational cost and memory usage. For 4k spherical images we realize a saving of $10^9$ in computational cost and $10^4$ in memory usage when compared to the most efficient alternative equivariant spherical convolution. We apply the DISCO spherical CNN framework to a number of benchmark dense-prediction problems on the sphere, such as semantic segmentation and depth estimation, on all of which we achieve the state-of-the-art performance.  ( 2 min )
    RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks. (arXiv:2106.08928v6 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are widely used throughout neuroscience as models of local neural activity. Many properties of single RNNs are well characterized theoretically, but experimental neuroscience has moved in the direction of studying multiple interacting areas, and RNN theory needs to be likewise extended. We take a constructive approach towards this problem, leveraging tools from nonlinear control theory and machine learning to characterize when combinations of stable RNNs will themselves be stable. Importantly, we derive conditions which allow for massive feedback connections between interacting RNNs. We parameterize these conditions for easy optimization using gradient-based techniques, and show that stability-constrained "networks of networks" can perform well on challenging sequential-processing benchmark tasks. Altogether, our results provide a principled approach towards understanding distributed, modular function in the brain.  ( 2 min )
    A One-shot Framework for Distributed Clustered Learning in Heterogeneous Environments. (arXiv:2209.10866v3 [cs.LG] UPDATED)
    The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments in which users obtain data from one of $K$ different data distributions. In the proposed setup, the grouping of users based on the data distributions they sample, as well as the underlying statistical properties of the distributions are apriori unknown. A family of One-shot Distributed Clustered Learning methods (ODCL-$\mathcal{C}$) is proposed, parametrized by the set of admissible clustering algorithms $\mathcal{C}$, with the objective of learning the true model at each user. The admissible clustering methods include $K$-means (KM) and convex clustering (CC), giving rise to various one-shot methods within the proposed family, such as ODCL-KM and ODCL-CC. The proposed one-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees. In particular, for strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error (MSE) rates in terms of the sample size. An explicit characterization of the threshold is provided in terms of the problem parameters. Numerical experiments illustrate the findings and corroborate the performance of the proposed methods. We also highlight the trade-offs with respect to selecting various clustering methods (ODCL-CC, ODCL-KM) and demonstrate significant improvements over state-of-the-art.  ( 2 min )
    Likelihood-Free Frequentist Inference: Confidence Sets with Correct Conditional Coverage. (arXiv:2107.03920v5 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, particularly outside asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage for small sample sizes. This paper unifies classical statistics with modern machine learning to present (i) a practical procedure for the Neyman construction of confidence sets with finite-sample guarantees of nominal coverage, and (ii) diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can leverage the LF2I machinery to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our paper discusses the benefits and challenges of LF2I, with a breakdown of the sources of errors in LF2I confidence sets.
    Inference on the Optimal Assortment in the Multinomial Logit Model. (arXiv:2301.12254v1 [stat.ML])
    Assortment optimization has received active explorations in the past few decades due to its practical importance. Despite the extensive literature dealing with optimization algorithms and latent score estimation, uncertainty quantification for the optimal assortment still needs to be explored and is of great practical significance. Instead of estimating and recovering the complete optimal offer set, decision makers may only be interested in testing whether a given property holds true for the optimal assortment, such as whether they should include several products of interest in the optimal set, or how many categories of products the optimal set should include. This paper proposes a novel inferential framework for testing such properties. We consider the widely adopted multinomial logit (MNL) model, where we assume that each customer will purchase an item within the offered products with a probability proportional to the underlying preference score associated with the product. We reduce inferring a general optimal assortment property to quantifying the uncertainty associated with the sign change point detection of the marginal revenue gaps. We show the asymptotic normality of the marginal revenue gap estimator, and construct a maximum statistic via the gap estimators to detect the sign change point. By approximating the distribution of the maximum statistic with multiplier bootstrap techniques, we propose a valid testing procedure. We also conduct numerical experiments to assess the performance of our method.
    TOAST: Topological Algorithm for Singularity Tracking. (arXiv:2210.00069v2 [cs.LG] UPDATED)
    The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the 'manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.  ( 2 min )
    Decentralized Entropic Optimal Transport for Privacy-preserving Distributed Distribution Comparison. (arXiv:2301.12065v1 [cs.LG])
    Privacy-preserving distributed distribution comparison measures the distance between the distributions whose data are scattered across different agents in a distributed system and cannot be shared among the agents. In this study, we propose a novel decentralized entropic optimal transport (EOT) method, which provides a privacy-preserving and communication-efficient solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the decentralized EOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a distributed kernel approximation method, and each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. We analyze our method's communication complexity and provide a theoretical bound for the approximation error caused by the convergence error, the approximated kernel, and the mismatch between the storage and communication protocols. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.
    Data Heterogeneity Differential Privacy: From Theory to Algorithm. (arXiv:2002.08578v2 [cs.LG] UPDATED)
    Traditionally, the random noise is equally injected when training with different data instances in the field of differential privacy (DP). In this paper, we first give sharper excess risk bounds of DP stochastic gradient descent (SGD) method. Considering most of the previous methods are under convex conditions, we use Polyak-{\L}ojasiewicz condition to relax it in this paper. Then, after observing that different training data instances affect the machine learning model to different extent, we consider the heterogeneity of training data and attempt to improve the performance of DP-SGD from a new perspective. Specifically, by introducing the influence function (IF), we quantitatively measure the contributions of various training data on the final machine learning model. If the contribution made by a single data instance is so little that attackers cannot infer anything from the model, we do not add noise when training with it. Based on this observation, we design a `Performance Improving' DP-SGD algorithm: PIDP-SGD. Theoretical and experimental results show that our proposed PIDP-SGD improves the performance significantly.
    How Powerful are Shallow Neural Networks with Bandlimited Random Weights?. (arXiv:2008.08427v2 [cs.LG] UPDATED)
    We investigate the expressive power of depth-2 bandlimited random neural networks. A random net is a neural network where the hidden layer parameters are frozen with random assignment, and only the output layer parameters are trained by loss minimization. Using random weights for a hidden layer is an effective method to avoid non-convex optimization in standard gradient descent learning. It has also been adopted in recent deep learning theories. Despite the well-known fact that a neural network is a universal approximator, in this study, we mathematically show that when hidden parameters are distributed in a bounded domain, the network may not achieve zero approximation error. In particular, we derive a new nontrivial approximation error lower bound. The proof utilizes the technique of ridgelet analysis, a harmonic analysis method designed for neural networks. This method is inspired by fundamental principles in classical signal processing, specifically the idea that signals with limited bandwidth may not always be able to perfectly recreate the original signal. We corroborate our theoretical results with various simulation studies, and generally, two main take-home messages are offered: (i) Not any distribution for selecting random weights is feasible to build a universal approximator; (ii) A suitable assignment of random weights exists but to some degree is associated with the complexity of the target function.
    Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. (arXiv:2301.12222v1 [cs.LG])
    Cancer is a term that denotes a group of diseases caused by abnormal growth of cells that can spread in different parts of the body. According to the World Health Organization (WHO), cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer, as it is indicative of the biochemical processes in tissue and cells, as well as the genetic characteristics of an organism. Deoxyribonucleic Acid (DNA) microarrays and Ribonucleic Acid (RNA)- sequencing methods for gene expression data allow quantifying the expression levels of genes and produce valuable data for computational analysis. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods. Both conventional and deep learning-based approaches are reviewed, with an emphasis on the ap-plication of deep learning models due to their comparative advantages for identifying gene patterns that are distinctive for various types of cancers. Relevant works that employ the most commonly used deep neural network architectures are covered, including multi-layer perceptrons, convolutional, recurrent, graph, and transformer networks. This survey also presents an overview of the data collection methods for gene expression analysis and lists important datasets that are commonly used for supervised machine learning for this task. Furthermore, reviewed are pertinent techniques for feature engineering and data preprocessing that are typically used to handle the high dimensionality of gene expression data, caused by a large number of genes present in data samples. The paper concludes with a discussion of future research directions for machine learning-based gene expression analysis for cancer classification.
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v3 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed.  ( 2 min )
    Practical Differentially Private Hyperparameter Tuning with Subsampling. (arXiv:2301.11989v1 [cs.LG])
    Tuning all the hyperparameters of differentially private (DP) machine learning (ML) algorithms often requires use of sensitive data and this may leak private information via hyperparameter values. Recently, Papernot and Steinke (2022) proposed a certain class of DP hyperparameter tuning algorithms, where the number of random search samples is randomized itself. Commonly, these algorithms still considerably increase the DP privacy parameter $\varepsilon$ over non-tuned DP ML model training and can be computationally heavy as evaluating each hyperparameter candidate requires a new training run. We focus on lowering both the DP bounds and the computational complexity of these methods by using only a random subset of the sensitive data for the hyperparameter tuning and by extrapolating the optimal values from the small dataset to a larger dataset. We provide a R\'enyi differential privacy analysis for the proposed method and experimentally show that it consistently leads to better privacy-utility trade-off than the baseline method by Papernot and Steinke (2022).
    (Private) Kernelized Bandits with Distributed Biased Feedback. (arXiv:2301.12061v1 [cs.LG])
    In this paper, we study kernelized bandits with distributed biased feedback. This problem is motivated by several real-world applications (such as dynamic pricing, cellular network configuration, and policy making), where users from a large population contribute to the reward of the action chosen by a central entity, but it is difficult to collect feedback from all users. Instead, only biased feedback (due to user heterogeneity) from a subset of users may be available. In addition to such partial biased feedback, we are also faced with two practical challenges due to communication cost and computation complexity. To tackle these challenges, we carefully design a new \emph{distributed phase-then-batch-based elimination (\texttt{DPBE})} algorithm, which samples users in phases for collecting feedback to reduce the bias and employs \emph{maximum variance reduction} to select actions in batches within each phase. By properly choosing the phase length, the batch size, and the confidence width used for eliminating suboptimal actions, we show that \texttt{DPBE} achieves a sublinear regret of $\tilde{O}(T^{1-\alpha/2}+\sqrt{\gamma_T T})$, where $\alpha\in (0,1)$ is the user-sampling parameter one can tune. Moreover, \texttt{DPBE} can significantly reduce both communication cost and computation complexity in distributed kernelized bandits, compared to some variants of the state-of-the-art algorithms (originally developed for standard kernelized bandits). Furthermore, by incorporating various \emph{differential privacy} models (including the central, local, and shuffle models), we generalize \texttt{DPBE} to provide privacy guarantees for users participating in the distributed learning process. Finally, we conduct extensive simulations to validate our theoretical results and evaluate the empirical performance.
    SaFormer: A Conditional Sequence Modeling Approach to Offline Safe Reinforcement Learning. (arXiv:2301.12203v1 [cs.LG])
    Offline safe RL is of great practical relevance for deploying agents in real-world applications. However, acquiring constraint-satisfying policies from the fixed dataset is non-trivial for conventional approaches. Even worse, the learned constraints are stationary and may become invalid when the online safety requirement changes. In this paper, we present a novel offline safe RL approach referred to as SaFormer, which tackles the above issues via conditional sequence modeling. In contrast to existing sequence models, we propose cost-related tokens to restrict the action space and a posterior safety verification to enforce the constraint explicitly. Specifically, SaFormer performs a two-stage auto-regression conditioned by the maximum remaining cost to generate feasible candidates. It then filters out unsafe attempts and executes the optimal action with the highest expected return. Extensive experiments demonstrate the efficacy of SaFormer featuring (1) competitive returns with tightened constraint satisfaction; (2) adaptability to the in-range cost values of the offline data without retraining; (3) generalizability for constraints beyond the current dataset.  ( 2 min )
    Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming. (arXiv:2301.12187v1 [cs.LG])
    Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.61\times$speed-up with only $0.62$\%p accuracy drop in MobileNetV2-1.4 on the ImageNet.
    A VAE-Bayesian Deep Learning Scheme for Solar Generation Forecasting based on Dimensionality Reduction. (arXiv:2103.12969v2 [cs.LG] UPDATED)
    The advancement of distributed generation technologies in modern power systems has led to a widespread integration of renewable power generation at customer side. However, the intermittent nature of renewable energy poses new challenges to the network operational planning with underlying uncertainties. This paper proposes a novel Bayesian probabilistic technique for forecasting renewable solar generation by addressing data and model uncertainties by integrating bidirectional long short-term memory (BiLSTM) neural networks while compressing the weight parameters using variational autoencoder (VAE). Existing Bayesian deep learning methods suffer from high computational complexities as they require to draw a large number of samples from weight parameters expressed in the form of probability distributions. The proposed method can deal with uncertainty present in model and data in a more computationally efficient manner by reducing the dimensionality of model parameters. The proposed method is evaluated using quantile loss, reconstruction error, and deterministic forecasting evaluation metrics such as root-mean square error. It is inferred from the numerical results that VAE-Bayesian BiLSTM outperforms other probabilistic and deterministic deep learning methods for solar power forecasting in terms of accuracy and computational efficiency for different sizes of the dataset.
    Leveraging Importance Weights in Subset Selection. (arXiv:2301.12052v1 [cs.LG])
    We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.
    Relational Reasoning Networks. (arXiv:2106.00393v3 [cs.AI] UPDATED)
    Neuro-symbolic methods integrate neural architectures, knowledge representation and reasoning. However, they have been struggling at both dealing with the intrinsic uncertainty of the observations and scaling to real-world applications. This paper presents Relational Reasoning Networks (R2N), a novel end-to-end model that performs relational reasoning in the latent space of a deep learner architecture, where the representations of constants, ground atoms and their manipulations are learned in an integrated fashion. Unlike flat architectures like Knowledge Graph Embedders, which can only represent relations between entities, R2Ns define an additional computational structure, accounting for higher-level relations among the ground atoms. The considered relations can be explicitly known, like the ones defined by logic formulas, or defined as unconstrained correlations among groups of ground atoms. R2Ns can be applied to purely symbolic tasks or as a neuro-symbolic platform to integrate learning and reasoning in heterogeneous problems with both symbolic and feature-based represented entities. The proposed model overtakes the limitations of previous neuro-symbolic methods that have been either limited in terms of scalability or expressivity. The proposed methodology is shown to achieve state-of-the-art results in different experimental settings.
    APAC: Authorized Probability-controlled Actor-Critic For Offline Reinforcement Learning. (arXiv:2301.12130v1 [cs.LG])
    Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Most existing methods exclude the OOD areas or restrict the value of $Q$ function. However, these methods either are over-conservative or suffer from model uncertainty prediction. In this paper, we propose an authorized probabilistic-control policy learning (APAC) method. The proposed method learns the distribution characteristics of the feasible states/actions by utilizing the flow-GAN model. Specifically, APAC avoids taking action in the low probability density region of behavior policy, while allows exploration in the authorized high probability density region. Theoretical proofs are provided to justify the advantage of APAC. Empirically, APAC outperforms existing alternatives on a variety of simulated tasks, and yields higher expected returns.
    Anticipate, Ensemble and Prune: Improving Convolutional Neural Networks via Aggregated Early Exits. (arXiv:2301.12168v1 [cs.LG])
    Today, artificial neural networks are the state of the art for solving a variety of complex tasks, especially in image classification. Such architectures consist of a sequence of stacked layers with the aim of extracting useful information and having it processed by a classifier to make accurate predictions. However, intermediate information within such models is often left unused. In other cases, such as in edge computing contexts, these architectures are divided into multiple partitions that are made functional by including early exits, i.e. intermediate classifiers, with the goal of reducing the computational and temporal load without extremely compromising the accuracy of the classifications. In this paper, we present Anticipate, Ensemble and Prune (AEP), a new training technique based on weighted ensembles of early exits, which aims at exploiting the information in the structure of networks to maximise their performance. Through a comprehensive set of experiments, we show how the use of this approach can yield average accuracy improvements of up to 15% over traditional training. In its hybrid-weighted configuration, AEP's internal pruning operation also allows reducing the number of parameters by up to 41%, lowering the number of multiplications and additions by 18% and the latency time to make inference by 16%. By using AEP, it is also possible to learn weights that allow early exits to achieve better accuracy values than those obtained from single-output reference models.
    Towards Lossless ANN-SNN Conversion under Ultra-Low Latency with Dual-Phase Optimization. (arXiv:2205.07473v2 [cs.NE] UPDATED)
    Spiking neural network (SNN) operating with asynchronous discrete events shows higher energy efficiency. A popular approach to implementing deep SNNs is ANN-SNN conversion combining both efficient training of ANNs and efficient inference of SNNs. However, due to the intrinsic difference between ANNs and SNNs, the accuracy loss is usually non-negligible, especially under low simulating steps. It restricts the applications of SNN on latency-sensitive edge devices greatly. In this paper, we identify such performance degradation stems from the misrepresentation of the negative or overflow residual membrane potential in SNNs. Inspired by this, we systematically analyze the conversion error between SNNs and ANNs, and then decompose it into three folds: quantization error, clipping error, and residual membrane potential representation error. With such insights, we propose a dual-phase conversion algorithm to minimize those errors separately. Besides, we show each phase achieves significant performance gains in a complementary manner. We evaluate our method on challenging datasets including CIFAR-10, CIFAR-100, and ImageNet datasets. The experimental results show the proposed method achieves the state-of-the-art in terms of both accuracy and latency with promising energy preservation compared to ANNs. For instance, our method achieves an accuracy of 73.20% on CIFAR-100 in only 2 time steps with 15.7$\times$ less energy consumption.
    Unearthing InSights into Mars: unsupervised source separation with limited data. (arXiv:2301.11981v1 [cs.LG])
    Source separation entails the ill-posed problem of retrieving a set of source signals observed through a mixing operator. Solving this problem requires prior knowledge, which is commonly incorporated by imposing regularity conditions on the source signals or implicitly learned in supervised or unsupervised methods from existing data. While data-driven methods have shown great promise in source separation, they are often dependent on large amounts of data, which rarely exists in planetary space missions. Considering this challenge, we propose an unsupervised source separation scheme for domains with limited data access that involves solving an optimization problem in the wavelet scattering representation space$\unicode{x2014}$an interpretable low-dimensional representation of stationary processes. We present a real-data example in which we remove transient thermally induced microtilts, known as glitches, from data recorded by a seismometer during NASA's InSight mission on Mars. Owing to the wavelet scattering covariances' ability to capture non-Gaussian properties of stochastic processes, we are able to separate glitches using only a few glitch-free data snippets.
    Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos. (arXiv:2109.08275v2 [cs.MM] UPDATED)
    Geo-tagged photo based tourist attraction recommendation can discover users' travel preferences from their taken photos, so as to recommend suitable tourist attractions to them. However, existing visual content based methods cannot fully exploit the user and tourist attraction information of photos to extract visual features, and do not differentiate the significances of different photos. In this paper, we propose multi-level visual similarity based personalized tourist attraction recommendation using geo-tagged photos (MEAL). MEAL utilizes the visual contents of photos and interaction behavior data to obtain the final embeddings of users and tourist attractions, which are then used to predict the visit probabilities. Specifically, by crossing the user and tourist attraction information of photos, we define four visual similarity levels and introduce a corresponding quintuplet loss to embed the visual contents of photos. In addition, to capture the significances of different photos, we exploit the self-attention mechanism to obtain the visual representations of users and tourist attractions. We conducted experiments on a dataset crawled from Flickr, and the experimental results proved the advantage of this method.
    Neural Temporal Point Process for Forecasting Higher Order and Directional Interactions. (arXiv:2301.12210v1 [cs.LG])
    Real-world systems are made of interacting entities that evolve with time. Creating models that can forecast interactions by learning the dynamics of entities is an important problem in numerous fields. Earlier works used dynamic graph models to achieve this. However, real-world interactions are more complex than pairwise, as they involve more than two entities, and many of these higher-order interactions have directional components. Examples of these can be seen in communication networks such as email exchanges that involve a sender, and multiple recipients, citation networks, where authors draw upon the work of others, and so on. In this paper, we solve the problem of higher-order directed interaction forecasting by proposing a deep neural network-based model \textit{Directed HyperNode Temporal Point Process} for directed hyperedge event forecasting, as hyperedge provides native framework for modeling relationships among the variable number of nodes. Our proposed technique reduces the search space of possible candidate hyperedges by first forecasting the nodes at which events will be observed, based on which it generates candidate hyperedges. To demonstrate the efficiency of our model, we curated four datasets and conducted an extensive empirical study. We believe that this is the first work that solves the problem of forecasting higher-order directional interactions.
    Prompt-Based Editing for Text Style Transfer. (arXiv:2301.11997v1 [cs.CL])
    Prompting approaches have been recently explored in text style transfer, where a textual prompt is used to query a pretrained language model to generate style-transferred texts word by word in an autoregressive manner. However, such a generation process is less controllable and early prediction errors may affect future word predictions. In this paper, we present a prompt-based editing approach for text style transfer. Specifically, we prompt a pretrained language model for style classification and use the classification probability to compute a style score. Then, we perform discrete search with word-level editing to maximize a comprehensive scoring function for the style-transfer task. In this way, we transform a prompt-based generation problem into a classification one, which is a training-free process and more controllable than the autoregressive generation of sentences. In our experiments, we performed both automatic and human evaluation on three style-transfer benchmark datasets, and show that our approach largely outperforms the state-of-the-art systems that have 20 times more parameters. Additional empirical analyses further demonstrate the effectiveness of our approach.
    AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning. (arXiv:2301.12132v1 [cs.CL])
    Large pretrained language models have been widely used in downstream NLP tasks via task-specific fine-tuning. Recently, an array of Parameter-Efficient Fine-Tuning (PEFT) methods have also achieved strong task performance while updating a much smaller number of parameters compared to full model tuning. However, it is non-trivial to make informed per-task design choices (i.e., to create PEFT configurations) concerning the selection of PEFT architectures and modules, the number of tunable parameters, and even the layers in which the PEFT modules are inserted. Consequently, it is highly likely that the current, manually set PEFT configurations might be suboptimal for many tasks from the perspective of the performance-to-efficiency trade-off. To address the core question of the PEFT configuration selection that aims to control and maximise the balance between performance and parameter efficiency, we first define a rich configuration search space spanning multiple representative PEFT modules along with finer-grained configuration decisions over the modules (e.g., parameter budget, insertion layer). We then propose AutoPEFT, a novel framework to traverse this configuration space: it automatically configures multiple PEFT modules via high-dimensional Bayesian optimisation. We show the resource scalability and task transferability of AutoPEFT-found configurations, outperforming existing PEFT methods on average on the standard GLUE benchmark while conducting the configuration search on a single task. The per-task AutoPEFT-based configuration search even outperforms full-model fine-tuning.
    STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning. (arXiv:2301.12038v1 [cs.LG])
    Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
    Vertex-based reachability analysis for verifying ReLU deep neural networks. (arXiv:2301.12001v1 [cs.LG])
    Neural networks achieved high performance over different tasks, i.e. image identification, voice recognition and other applications. Despite their success, these models are still vulnerable regarding small perturbations, which can be used to craft the so-called adversarial examples. Different approaches have been proposed to circumvent their vulnerability, including formal verification systems, which employ a variety of techniques, including reachability, optimization and search procedures, to verify that the model satisfies some property. In this paper we propose three novel reachability algorithms for verifying deep neural networks with ReLU activations. The first and third algorithms compute an over-approximation for the reachable set, whereas the second one computes the exact reachable set. Differently from previously proposed approaches, our algorithms take as input a V-polytope. Our experiments on the ACAS Xu problem show that the Exact Polytope Network Mapping (EPNM) reachability algorithm proposed in this work surpass the state-of-the-art results from the literature, specially in relation to other reachability methods.
    TIDo: Source-free Task Incremental Learning in Non-stationary Environments. (arXiv:2301.12055v1 [cs.LG])
    This work presents an incremental learning approach for autonomous agents to learn new tasks in a non-stationary environment. Updating a DNN model-based agent to learn new target tasks requires us to store past training data and needs a large labeled target task dataset. Few-shot task incremental learning methods overcome the limitation of labeled target datasets by adapting trained models to learn private target classes using a few labeled representatives and a large unlabeled target dataset. However, the methods assume that the source and target tasks are stationary. We propose a one-shot task incremental learning approach that can adapt to non-stationary source and target tasks. Our approach minimizes adversarial discrepancy between the model's feature space and incoming incremental data to learn an updated hypothesis. We also use distillation loss to reduce catastrophic forgetting of previously learned tasks. Finally, we use Gaussian prototypes to generate exemplar instances eliminating the need to store past training data. Unlike current work in task incremental learning, our model can learn both source and target task updates incrementally. We evaluate our method on various problem settings for incremental object detection and disease prediction model update. We evaluate our approach by measuring the performance of shared class and target private class prediction. Our results show that our approach achieved improved performance compared to existing state-of-the-art task incremental learning methods.
    Analyzing Robustness of the Deep Reinforcement Learning Algorithm in Ramp Metering Applications Considering False Data Injection Attack and Defense. (arXiv:2301.12036v1 [cs.LG])
    Decades of practices of ramp metering, by controlling downstream volume and smoothing the interweaving traffic, have proved that ramp metering can decrease total travel time, mitigate shockwaves, decrease rear-end collisions, reduce pollution, etc. Besides traditional methods like ALIENA algorithms, Deep Reinforcement Learning algorithms have been established recently to build finer control on ramp metering. However, those Deep Learning models may be venerable to adversarial attacks. Thus, it is important to investigate the robustness of those models under False Data Injection adversarial attack. Furthermore, algorithms capable of detecting anomaly data from clean data are the key to safeguard Deep Learning algorithm. In this study, an online algorithm that can distinguish adversarial data from clean data are tested. Results found that in most cases anomaly data can be distinguished from clean data, although their difference is too small to be manually distinguished by humans. In practice, whenever adversarial/hazardous data is detected, the system can fall back to a fixed control program, and experts should investigate the detectors status or security protocols afterwards before real damages happen.
    Autoencoder-Based Unequal Error Protection Codes. (arXiv:2301.12231v1 [cs.IT])
    Most of today's communication systems are designed to target reliable message recovery after receiving the entire encoded message (codeword). However, in many practical scenarios, the transmission process may be interrupted before receiving the complete codeword. This paper proposes a novel rateless autoencoder (AE)-based code design suitable for decoding the transmitted message before the noisy codeword is fully received. Using particular dropout strategies applied during the training process, rateless AE codes allow to trade off between decoding delay and reliability, providing a graceful improvement of the latter with each additionally received codeword symbol. The proposed rateless AEs significantly outperform the conventional AE designs for scenarios where it is desirable to trade off reliability for lower decoding delay.
    SEGA: Instructing Diffusion using Semantic Dimensions. (arXiv:2301.12247v1 [cs.CV])
    Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.  ( 2 min )
    Violation-Aware Contextual Bayesian Optimization for Controller Performance Optimization with Unmodeled Constraints. (arXiv:2301.12099v1 [cs.LG])
    We study the problem of performance optimization of closed-loop control systems with unmodeled dynamics. Bayesian optimization (BO) has been demonstrated to be effective for improving closed-loop performance by automatically tuning controller gains or reference setpoints in a model-free manner. However, BO methods have rarely been tested on dynamical systems with unmodeled constraints and time-varying ambient conditions. In this paper, we propose a violation-aware contextual BO algorithm (VACBO) that optimizes closed-loop performance while simultaneously learning constraint-feasible solutions under time-varying ambient conditions. Unlike classical constrained BO methods which allow unlimited constraint violations, or 'safe' BO algorithms that are conservative and try to operate with near-zero violations, we allow budgeted constraint violations to improve constraint learning and accelerate optimization. We demonstrate the effectiveness of our proposed VACBO method for energy minimization of industrial vapor compression systems under time-varying ambient temperature and humidity.
    Reachability Analysis of Neural Network Control Systems. (arXiv:2301.12100v1 [cs.LG])
    Neural network controllers (NNCs) have shown great promise in autonomous and cyber-physical systems. Despite the various verification approaches for neural networks, the safety analysis of NNCs remains an open problem. Existing verification approaches for neural network control systems (NNCSs) either can only work on a limited type of activation functions, or result in non-trivial over-approximation errors with time evolving. This paper proposes a verification framework for NNCS based on Lipschitzian optimisation, called DeepNNC. We first prove the Lipschitz continuity of closed-loop NNCSs by unrolling and eliminating the loops. We then reveal the working principles of applying Lipschitzian optimisation on NNCS verification and illustrate it by verifying an adaptive cruise control model. Compared to state-of-the-art verification approaches, DeepNNC shows superior performance in terms of efficiency and accuracy over a wide range of NNCs. We also provide a case study to demonstrate the capability of DeepNNC to handle a real-world, practical, and complex system. Our tool \textbf{DeepNNC} is available at \url{https://github.com/TrustAI/DeepNNC}.
    Chaos as an interpretable benchmark for forecasting and data-driven modelling. (arXiv:2110.05266v2 [cs.LG] UPDATED)
    The striking fractal geometry of strange attractors underscores the generative nature of chaos: like probability distributions, chaotic systems can be repeatedly measured to produce arbitrarily-detailed information about the underlying attractor. Chaotic systems thus pose a unique challenge to modern statistical learning techniques, while retaining quantifiable mathematical properties that make them controllable and interpretable as benchmarks. Here, we present a growing database currently comprising 131 known chaotic dynamical systems spanning fields such as astrophysics, climatology, and biochemistry. Each system is paired with precomputed multivariate and univariate time series. Our dataset has comparable scale to existing static time series databases; however, our systems can be re-integrated to produce additional datasets of arbitrary length and granularity. Our dataset is annotated with known mathematical properties of each system, and we perform feature analysis to broadly categorize the diverse dynamics present across the collection. Chaotic systems inherently challenge forecasting models, and across extensive benchmarks we correlate forecasting performance with the degree of chaos present. We also exploit the unique generative properties of our dataset in several proof-of-concept experiments: surrogate transfer learning to improve time series classification, importance sampling to accelerate model training, and benchmarking symbolic regression algorithms.
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v4 [cs.CV] UPDATED)
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    Node Injection for Class-specific Network Poisoning. (arXiv:2301.12277v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful in learning rich network representations that aid the performance of downstream tasks. However, recent studies showed that GNNs are vulnerable to adversarial attacks involving node injection and network perturbation. Among these, node injection attacks are more practical as they don't require manipulation in the existing network and can be performed more realistically. In this paper, we propose a novel problem statement - a class-specific poison attack on graphs in which the attacker aims to misclassify specific nodes in the target class into a different class using node injection. Additionally, nodes are injected in such a way that they camouflage as benign nodes. We propose NICKI, a novel attacking strategy that utilizes an optimization-based approach to sabotage the performance of GNN-based node classifiers. NICKI works in two phases - it first learns the node representation and then generates the features and edges of the injected nodes. Extensive experiments and ablation studies on four benchmark networks show that NICKI is consistently better than four baseline attacking strategies for misclassifying nodes in the target class. We also show that the injected nodes are properly camouflaged as benign, thus making the poisoned graph indistinguishable from its clean version w.r.t various topological properties.
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v3 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al., (2022). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of the designs in Na et al., (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond. (arXiv:2203.07835v3 [cs.LG] UPDATED)
    With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v3 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.
    A Closer Look at Few-shot Classification Again. (arXiv:2301.12246v1 [cs.LG])
    Few-shot classification consists of a training phase where a model is learned on a relatively large dataset and an adaptation phase where the learned model is adapted to previously-unseen tasks with limited labeled samples. In this paper, we empirically prove that the training algorithm and the adaptation algorithm can be completely disentangled, which allows algorithm analysis and design to be done individually for each phase. Our meta-analysis for each phase reveals several interesting insights that may help better understand key aspects of few-shot classification and connections with other fields such as visual representation learning and transfer learning. We hope the insights and research challenges revealed in this paper can inspire future work in related directions.
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v3 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.
    Improved knowledge distillation by utilizing backward pass knowledge in neural networks. (arXiv:2301.12006v1 [cs.LG])
    Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to better-match the output of the student model to that of the teacher model based on the knowledge extracts from the forward pass of the teacher network. Although conventional KD is effective for matching the two networks over the given data points, there is no guarantee that these models would match in other areas for which we do not have enough training samples. In this work, we address that problem by generating new auxiliary training samples based on extracting knowledge from the backward pass of the teacher in the areas where the student diverges greatly from the teacher. We compute the difference between the teacher and the student and generate new data samples that maximize the divergence. This is done by perturbing data samples in the direction of the gradient of the difference between the student and the teacher. Augmenting the training set by adding this auxiliary improves the performance of KD significantly and leads to a closer match between the student and the teacher. Using this approach, when data samples come from a discrete domain, such as applications of natural language processing (NLP) and language understanding, is not trivial. However, we show how this technique can be used successfully in such applications. We evaluated the performance of our method on various tasks in computer vision and NLP domains and got promising results.
    Controlling Steering with Energy-Based Models. (arXiv:2301.12264v1 [cs.RO])
    So-called implicit behavioral cloning with energy-based models has shown promising results in robotic manipulation tasks. We tested if the method's advantages carry on to controlling the steering of a real self-driving car with an end-to-end driving model. We performed an extensive comparison of the implicit behavioral cloning approach with explicit baseline approaches, all sharing the same neural network backbone architecture. Baseline explicit models were trained with regression (MAE) loss, classification loss (softmax and cross-entropy on a discretization), or as mixture density networks (MDN). While models using the energy-based formulation performed comparably to baseline approaches in terms of safety driver interventions, they had a higher whiteness measure, indicating higher jerk. To alleviate this, we show two methods that can be used to improve the smoothness of steering. We confirmed that energy-based models handle multimodalities slightly better than simple regression, but this did not translate to significantly better driving ability. We argue that the steering-only road-following task has too few multimodalities to benefit from energy-based models. This shows that applying implicit behavioral cloning to real-world tasks can be challenging, and further investigation is needed to bring out the theoretical advantages of energy-based models.
    Optimization for Amortized Inverse Problems. (arXiv:2210.13983v3 [cs.LG] UPDATED)
    Incorporating a deep generative model as the prior distribution in inverse problems has established substantial success in reconstructing images from corrupted observations. Notwithstanding, the existing optimization approaches use gradient descent largely without adapting to the non-convex nature of the problem and can be sensitive to initial values, impeding further performance improvement. In this paper, we propose an efficient amortized optimization scheme for inverse problems with a deep generative prior. Specifically, the optimization task with high degrees of difficulty is decomposed into optimizing a sequence of much easier ones. We provide a theoretical guarantee of the proposed algorithm and empirically validate it on different inverse problems. As a result, our approach outperforms baseline methods qualitatively and quantitatively by a large margin.
    Policy Gradient Methods for Distortion Risk Measures. (arXiv:2107.04422v6 [cs.LG] UPDATED)
    We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning (RL) framework. Our proposed algorithms maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process in on-policy as well as off-policy RL settings. We derive a variant of the policy gradient theorem that caters to the DRM objective, and use this theorem in conjunction with a likelihood ratio-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v4 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.  ( 2 min )
    Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models. (arXiv:2211.17091v2 [cs.CV] UPDATED)
    While there are many score-based models with various diffusing strategies as well as many numerical schemes of the denoising process, only a few works have explored the score part of the generative SDE. This paper introduces a new generative SDE with score adjustment using an auxiliary discriminator. The goal is to improve the original generative process of a pre-trained diffusion model by estimating the gap between the pre-trained score estimation and the true data score. This is done by training a discriminator that classifies diffused real data and diffused sample data. The gap estimation is then used to adjust the pre-trained score network. In experiments, the method enables new SOTA FIDs of 1.77/1.64 on unconditional/conditional CIFAR-10, and new SOTA FID/sFID of 3.18/4.53 on ImageNet 256x256.  ( 2 min )
    Online Markov Decision Processes with Non-oblivious Strategic Adversary. (arXiv:2110.03604v3 [cs.LG] UPDATED)
    We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of $\mathcal{O}(\sqrt{T \log(L)}+\tau^2\sqrt{ T \log(|A|)})$ where $L$ is the size of adversary's pure strategy set and $|A|$ denotes the size of agent's action space. Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of $\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})$ where $k$ depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence result to a NE. To our best knowledge, this is first work leading to the last iteration result in OMDPs.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 2 min )
    Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics. (arXiv:2211.08227v2 [cs.DC] UPDATED)
    Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations. We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly.  ( 2 min )
    Multi-Center Federated Learning: Clients Clustering for Better Personalization. (arXiv:2108.08647v3 [cs.LG] UPDATED)
    Personalized decision-making can be implemented in a Federated learning (FL) framework that can collaboratively train a decision model by extracting knowledge across intelligent clients, e.g. smartphones or enterprises. FL can mitigate the data privacy risk of collaborative training since it merely collects local gradients from users without access to their data. However, FL is fragile in the presence of statistical heterogeneity that is commonly encountered in personalized decision-making, e.g., non-IID data over different clients. Existing FL approaches usually update a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. By comparison, a mixture of multiple global models could capture the heterogeneity across various clients if assigning the client to different global models (i.e., centers) in FL. To this end, we propose a novel multi-center aggregation mechanism to cluster clients using their models' parameters. It learns multiple global models from data as the cluster centers, and simultaneously derives the optimal matching between users and centers. We then formulate it as an optimization problem that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Experiments on multiple benchmark datasets of FL show that our method outperforms several popular baseline methods. The experimental source codes are publicly available on the Github repository https://github.com/mingxuts/multi-center-fed-learning .  ( 2 min )
    Neural Integral Equations. (arXiv:2209.15190v3 [cs.LG] UPDATED)
    Integral equations (IEs) are equations that model spatiotemporal systems with non-local interactions. They have found important applications throughout theoretical and applied sciences, including in physics, chemistry, biology, and engineering. While efficient algorithms exist for solving given IEs, no method exists that can learn an IE and its associated dynamics from data alone. In this paper, we introduce Neural Integral Equations (NIE), a method that learns an unknown integral operator from data through an IE solver. We also introduce Attentional Neural Integral Equations (ANIE), where the integral is replaced by self-attention, which improves scalability and model capacity. We demonstrate that (A)NIE outperforms other methods in both speed and accuracy on several benchmark tasks in ODE, PDE, and IE systems of synthetic and real-world data.  ( 2 min )
    Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks. (arXiv:2301.06646v2 [cs.LG] UPDATED)
    Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers.  ( 2 min )
    Machine Learning Accelerators in 2.5D Chiplet Platforms with Silicon Photonics. (arXiv:2301.12252v1 [cs.AR])
    Domain-specific machine learning (ML) accelerators such as Google's TPU and Apple's Neural Engine now dominate CPUs and GPUs for energy-efficient ML processing. However, the evolution of electronic accelerators is facing fundamental limits due to the limited computation density of monolithic processing chips and the reliance on slow metallic interconnects. In this paper, we present a vision of how optical computation and communication can be integrated into 2.5D chiplet platforms to drive an entirely new class of sustainable and scalable ML hardware accelerators. We describe how cross-layer design and fabrication of optical devices, circuits, and architectures, and hardware/software codesign can help design efficient photonics-based 2.5D chiplet platforms to accelerate emerging ML workloads.  ( 2 min )
    Mutual Wasserstein Discrepancy Minimization for Sequential Recommendation. (arXiv:2301.12197v1 [cs.LG])
    Self-supervised sequential recommendation significantly improves recommendation performance by maximizing mutual information with well-designed data augmentations. However, the mutual information estimation is based on the calculation of Kullback Leibler divergence with several limitations, including asymmetrical estimation, the exponential need of the sample size, and training instability. Also, existing data augmentations are mostly stochastic and can potentially break sequential correlations with random modifications. These two issues motivate us to investigate an alternative robust mutual information measurement capable of modeling uncertainty and alleviating KL divergence limitations. To this end, we propose a novel self-supervised learning framework based on Mutual WasserStein discrepancy minimization MStein for the sequential recommendation. We propose the Wasserstein Discrepancy Measurement to measure the mutual information between augmented sequences. Wasserstein Discrepancy Measurement builds upon the 2-Wasserstein distance, which is more robust, more efficient in small batch sizes, and able to model the uncertainty of stochastic augmentation processes. We also propose a novel contrastive learning loss based on Wasserstein Discrepancy Measurement. Extensive experiments on four benchmark datasets demonstrate the effectiveness of MStein over baselines. More quantitative analyses show the robustness against perturbations and training efficiency in batch size. Finally, improvements analysis indicates better representations of popular users or items with significant uncertainty. The source code is at https://github.com/zfan20/MStein.  ( 2 min )
    Stochastic Dimension-reduced Second-order Methods for Policy Optimization. (arXiv:2301.12174v1 [math.OC])
    In this paper, we propose several new stochastic second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration, making them computationally efficient and comparable to policy gradient methods. Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem. We show that DR-SOPO obtains an $\mathcal{O}(\epsilon^{-3.5})$ complexity for reaching approximate first-order stationary condition and certain subspace second-order stationary condition. In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $\mathcal{O}(\epsilon^{-3})$ based on the variance reduction technique. Preliminary experiments show that our proposed algorithms perform favorably compared with stochastic and variance-reduced policy gradient methods.  ( 2 min )
    Adapting Neural Link Predictors for Complex Query Answering. (arXiv:2301.12313v1 [cs.LG])
    Answering complex queries on incomplete knowledge graphs is a challenging task where a model needs to answer complex logical queries in the presence of missing knowledge. Recently, Arakelyan et al. (2021); Minervini et al. (2022) showed that neural link predictors could also be used for answering complex queries: their Continuous Query Decomposition (CQD) method works by decomposing complex queries into atomic sub-queries, answers them using neural link predictors and aggregates their scores via t-norms for ranking the answers to each complex query. However, CQD does not handle negations and only uses the training signal from atomic training queries: neural link prediction scores are not calibrated to interact together via fuzzy logic t-norms during complex query answering. In this work, we propose to address this problem by training a parameter-efficient score adaptation model to re-calibrate neural link prediction scores: this new component is trained on complex queries by back-propagating through the complex query-answering process. Our method, CQD$^{A}$, produces significantly more accurate results than current state-of-the-art methods, improving from $34.4$ to $35.1$ Mean Reciprocal Rank values averaged across all datasets and query types while using $\leq 35\%$ of the available training query types. We further show that CQD$^{A}$ is data-efficient, achieving competitive results with only $1\%$ of the training data, and robust in out-of-domain evaluations.
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v3 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.
    Unbiased and Efficient Self-Supervised Incremental Contrastive Learning. (arXiv:2301.12104v1 [cs.LG])
    Contrastive Learning (CL) has been proved to be a powerful self-supervised approach for a wide range of domains, including computer vision and graph representation learning. However, the incremental learning issue of CL has rarely been studied, which brings the limitation in applying it to real-world applications. Contrastive learning identifies the samples with the negative ones from the noise distribution that changes in the incremental scenarios. Therefore, only fitting the change of data without noise distribution causes bias, and directly retraining results in low efficiency. To bridge this research gap, we propose a self-supervised Incremental Contrastive Learning (ICL) framework consisting of (i) a novel Incremental InfoNCE (NCE-II) loss function by estimating the change of noise distribution for old data to guarantee no bias with respect to the retraining, (ii) a meta-optimization with deep reinforced Learning Rate Learning (LRL) mechanism which can adaptively learn the learning rate according to the status of the training processes and achieve fast convergence which is critical for incremental learning. Theoretically, the proposed ICL is equivalent to retraining, which is based on solid mathematical derivation. In practice, extensive experiments in different domains demonstrate that, without retraining a new model, ICL achieves up to 16.7x training speedup and 16.8x faster convergence with competitive results.  ( 2 min )
    EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval. (arXiv:2301.12005v1 [cs.LG])
    Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the IR literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a distillation-friendly embedding geometry, especially for DE student models.  ( 2 min )
    Deep Operator Learning Lessens the Curse of Dimensionality for PDEs. (arXiv:2301.12227v1 [cs.LG])
    Deep neural networks (DNNs) have seen tremendous success in many fields and their developments in PDE-related problems are rapidly growing. This paper provides an estimate for the generalization error of learning Lipschitz operators over Banach spaces using DNNs with applications to various PDE solution operators. The goal is to specify DNN width, depth, and the number of training samples needed to guarantee a certain testing error. Under mild assumptions on data distributions or operator structures, our analysis shows that deep operator learning can have a relaxed dependence on the discretization resolution of PDEs and, hence, lessen the curse of dimensionality in many PDE-related problems. We apply our results to various PDEs, including elliptic equations, parabolic equations, and Burgers equations.
    Meta-Learning Parameterized Skills. (arXiv:2206.03597v2 [cs.LG] UPDATED)
    We propose a novel parameterized skill-learning algorithm that aims to learn transferable parameterized skills and synthesize them into a new action space that supports efficient learning in long-horizon tasks. We propose to leverage off-policy Meta-RL combined with a trajectory-centric smoothness term to learn a set of parameterized skills. Our agent can use these learned skills to construct a three-level hierarchical framework that models a Temporally-extended Parameterized Action Markov Decision Process. We empirically demonstrate that the proposed algorithms enable an agent to solve a set of difficult long-horizon (obstacle-course and robot manipulation) tasks.  ( 2 min )
    MetaNO: How to Transfer Your Knowledge on Learning Hidden Physics. (arXiv:2301.12095v1 [cs.LG])
    Gradient-based meta-learning methods have primarily been applied to classical machine learning tasks such as image classification. Recently, PDE-solving deep learning methods, such as neural operators, are starting to make an important impact on learning and predicting the response of a complex physical system directly from observational data. Since the data acquisition in this context is commonly challenging and costly, the call of utilization and transfer of existing knowledge to new and unseen physical systems is even more acute. Herein, we propose a novel meta-learning approach for neural operators, which can be seen as transferring the knowledge of solution operators between governing (unknown) PDEs with varying parameter fields. Our approach is a provably universal solution operator for multiple PDE solving tasks, with a key theoretical observation that underlying parameter fields can be captured in the first layer of neural operator models, in contrast to typical final-layer transfer in existing meta-learning methods. As applications, we demonstrate the efficacy of our proposed approach on PDE-based datasets and a real-world material modeling problem, illustrating that our method can handle complex and nonlinear physical response learning tasks while greatly improving the sampling efficiency in unseen tasks.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v3 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks which, instead of considering a distribution over weights, samples outputs of each layer from a corresponding Gaussian distribution, parametrized by the predictions of mean and variance sub-layers. In uncertainty quality estimation experiments, we show that the proposed method achieves better uncertainty quality than other single-bin Bayesian Model Averaging methods, such as Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    Selecting Models based on the Risk of Damage Caused by Adversarial Attacks. (arXiv:2301.12151v1 [cs.LG])
    Regulation, legal liabilities, and societal concerns challenge the adoption of AI in safety and security-critical applications. One of the key concerns is that adversaries can cause harm by manipulating model predictions without being detected. Regulation hence demands an assessment of the risk of damage caused by adversaries. Yet, there is no method to translate this high-level demand into actionable metrics that quantify the risk of damage. In this article, we propose a method to model and statistically estimate the probability of damage arising from adversarial attacks. We show that our proposed estimator is statistically consistent and unbiased. In experiments, we demonstrate that the estimation results of our method have a clear and actionable interpretation and outperform conventional metrics. We then show how operators can use the estimation results to reliably select the model with the lowest risk.  ( 2 min )
    Temporal Context Mining for Learned Video Compression. (arXiv:2111.13850v2 [cs.CV] UPDATED)
    We address end-to-end learned video compression with a special focus on better learning and utilizing temporal contexts. For temporal context mining, we propose to store not only the previously reconstructed frames, but also the propagated features into the generalized decoded picture buffer. From the stored propagated features, we propose to learn multi-scale temporal contexts, and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder. Our scheme discards the parallelization-unfriendly auto-regressive entropy model to pursue a more practical decoding time. We compare our scheme with x264 and x265 (representing industrial software for H.264 and H.265, respectively) as well as the official reference software for H.264, H.265, and H.266 (JM, HM, and VTM, respectively). When intra period is 32 and oriented to PSNR, our scheme outperforms H.265--HM by 14.4% bit rate saving; when oriented to MS-SSIM, our scheme outperforms H.266--VTM by 21.1% bit rate saving.  ( 2 min )
    Complexity-Based Prompting for Multi-Step Reasoning. (arXiv:2210.00720v2 [cs.CL] UPDATED)
    We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multi-step reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift.  ( 2 min )
    Graph Neural Networks Intersect Probabilistic Graphical Models: A Survey. (arXiv:2206.06089v3 [cs.AI] UPDATED)
    Graphs are a powerful data structure to represent relational data and are widely used to describe complex real-world data structures. Probabilistic Graphical Models (PGMs) have been well-developed in the past years to mathematically model real-world scenarios in compact graphical representations of distributions of variables. Graph Neural Networks (GNNs) are new inference methods developed in recent years and are attracting growing attention due to their effectiveness and flexibility in solving inference and learning problems over graph-structured data. These two powerful approaches have different advantages in capturing relations from observations and how they conduct message passing, and they can benefit each other in various tasks. In this survey, we broadly study the intersection of GNNs and PGMs. Specifically, we first discuss how GNNs can benefit from learning structured representations in PGMs, generate explainable predictions by PGMs, and how PGMs can infer object relationships. Then we discuss how GNNs are implemented in PGMs for more efficient inference and structure learning. In the end, we summarize the benchmark datasets used in recent studies and discuss promising future directions.  ( 2 min )
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v3 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.  ( 2 min )
    ClusterFuG: Clustering Fully connected Graphs by Multicut. (arXiv:2301.12159v1 [cs.CV])
    We propose a graph clustering formulation based on multicut (a.k.a. weighted correlation clustering) on the complete graph. Our formulation does not need specification of the graph topology as in the original sparse formulation of multicut, making our approach simpler and potentially better performing. In contrast to unweighted correlation clustering we allow for a more expressive weighted cost structure. In dense multicut, the clustering objective is given in a factorized form as inner products of node feature vectors. This allows for an efficient formulation and inference in contrast to multicut/weighted correlation clustering, which has at least quadratic representation and computation complexity when working on the complete graph. We show how to rewrite classical greedy algorithms for multicut in our dense setting and how to modify them for greater efficiency and solution quality. In particular, our algorithms scale to graphs with tens of thousands of nodes. Empirical evidence on instance segmentation on Cityscapes and clustering of ImageNet datasets shows the merits of our approach.  ( 2 min )
    Masked Contrastive Learning for Anomaly Detection. (arXiv:2105.08793v2 [cs.LG] UPDATED)
    Detecting anomalies is one fundamental aspect of a safety-critical software system, however, it remains a long-standing problem. Numerous branches of works have been proposed to alleviate the complication and have demonstrated their efficiencies. In particular, self-supervised learning based methods are spurring interest due to their capability of learning diverse representations without additional labels. Among self-supervised learning tactics, contrastive learning is one specific framework validating their superiority in various fields, including anomaly detection. However, the primary objective of contrastive learning is to learn task-agnostic features without any labels, which is not entirely suited to discern anomalies. In this paper, we propose a task-specific variant of contrastive learning named masked contrastive learning, which is more befitted for anomaly detection. Moreover, we propose a new inference method dubbed self-ensemble inference that further boosts performance by leveraging the ability learned through auxiliary self-supervision tasks. By combining our models, we can outperform previous state-of-the-art methods by a significant margin on various benchmark datasets.  ( 2 min )
    Protein Representation Learning by Geometric Structure Pretraining. (arXiv:2203.06125v5 [cs.LG] UPDATED)
    Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.  ( 2 min )
    CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning. (arXiv:2301.12193v1 [cs.LG])
    Since random initial models in Federated Learning (FL) can easily result in unregulated Stochastic Gradient Descent (SGD) processes, existing FL methods greatly suffer from both slow convergence and poor accuracy, especially for non-IID scenarios. To address this problem, we propose a novel FL method named CyclicFL, which can quickly derive effective initial models to guide the SGD processes, thus improving the overall FL training performance. Based on the concept of Continual Learning (CL), we prove that CyclicFL approximates existing centralized pre-training methods in terms of classification and prediction performance. Meanwhile, we formally analyze the significance of data consistency between the pre-training and training stages of CyclicFL, showing the limited Lipschitzness of loss for the pre-trained models by CyclicFL. Unlike traditional centralized pre-training methods that require public proxy data, CyclicFL pre-trains initial models on selected clients cyclically without exposing their local data. Therefore, they can be easily integrated into any security-critical FL methods. Comprehensive experimental results show that CyclicFL can not only improve the classification accuracy by up to 16.21%, but also significantly accelerate the overall FL training processes.  ( 2 min )
    Norm-based Generalization Bounds for Compositionally Sparse Neural Networks. (arXiv:2301.12033v1 [cs.LG])
    In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.  ( 2 min )
    On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation. (arXiv:1910.08412v3 [cs.LG] UPDATED)
    Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.  ( 2 min )
    Sparse Oblique Decision Trees: A Tool to Understand and Manipulate Neural Net Features. (arXiv:2104.02922v2 [cs.LG] UPDATED)
    The widespread deployment of deep nets in practical applications has lead to a growing desire to understand how and why such black-box methods perform prediction. Much work has focused on understanding what part of the input pattern (an image, say) is responsible for a particular class being predicted, and how the input may be manipulated to predict a different class. We focus instead on understanding which of the internal features computed by the neural net are responsible for a particular class. We achieve this by mimicking part of the neural net with an oblique decision tree having sparse weight vectors at the decision nodes. Using the recently proposed Tree Alternating Optimization (TAO) algorithm, we are able to learn trees that are both highly accurate and interpretable. Such trees can faithfully mimic the part of the neural net they replaced, and hence they can provide insights into the deep net black box. Further, we show we can easily manipulate the neural net features in order to make the net predict, or not predict, a given class, thus showing that it is possible to carry out adversarial attacks at the level of the features. These insights and manipulations apply globally to the entire training and test set, not just at a local (single-instance) level. We demonstrate this robustly in the MNIST and ImageNet datasets with LeNet5 and VGG networks.  ( 2 min )
    Off-Policy Evaluation in Partially Observed Markov Decision Processes under Sequential Ignorability. (arXiv:2110.12343v3 [cs.LG] UPDATED)
    We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length), with an exponent that depends on the overlap of the target and behavior policies, and on the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes, but strictly easier than model-free off-policy evaluation.  ( 2 min )
    Pragmatic Fairness: Developing Policies with Outcome Disparity Control. (arXiv:2301.12278v1 [cs.LG])
    We introduce a causal framework for designing optimal policies that satisfy fairness constraints. We take a pragmatic approach asking what we can do with an action space available to us and only with access to historical data. We propose two different fairness constraints: a moderation breaking constraint which aims at blocking moderation paths from the action and sensitive attribute to the outcome, and by that at reducing disparity in outcome levels as much as the provided action space permits; and an equal benefit constraint which aims at distributing gain from the new and maximized policy equally across sensitive attribute levels, and thus at keeping pre-existing preferential treatment in place or avoiding the introduction of new disparity. We introduce practical methods for implementing the constraints and illustrate their uses on experiments with semi-synthetic models.  ( 2 min )
    Heterogeneous Datasets for Federated Survival Analysis Simulation. (arXiv:2301.12166v1 [cs.LG])
    Survival analysis studies time-modeling techniques for an event of interest occurring for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, the data needed to train survival models are often distributed, incomplete, censored, and confidential. In this context, federated learning can be exploited to tremendously improve the quality of the models trained on distributed data while preserving user privacy. However, federated survival analysis is still in its early development, and there is no common benchmarking dataset to test federated survival models. This work proposes a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way. Specifically, we provide two novel dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client: quantity-skewed splitting and label-skewed splitting. Furthermore, these algorithms allow for obtaining different levels of heterogeneity by changing a single hyperparameter. Finally, numerical experiments provide a quantitative evaluation of the heterogeneity level using log-rank tests and a qualitative analysis of the generated splits. The implementation of the proposed methods is publicly available in favor of reproducibility and to encourage common practices to simulate federated environments for survival analysis.  ( 2 min )
    GFlowNets and variational inference. (arXiv:2210.00580v2 [cs.LG] UPDATED)
    This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v3 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    A Dependable Hybrid Machine Learning Model for Network Intrusion Detection. (arXiv:2212.04546v2 [cs.CR] UPDATED)
    Network intrusion detection systems (NIDSs) play an important role in computer network security. There are several detection mechanisms where anomaly-based automated detection outperforms others significantly. Amid the sophistication and growing number of attacks, dealing with large amounts of data is a recognized issue in the development of anomaly-based NIDS. However, do current models meet the needs of today's networks in terms of required accuracy and dependability? In this research, we propose a new hybrid model that combines machine learning and deep learning to increase detection rates while securing dependability. Our proposed method ensures efficient pre-processing by combining SMOTE for data balancing and XGBoost for feature selection. We compared our developed method to various machine learning and deep learning algorithms to find a more efficient algorithm to implement in the pipeline. Furthermore, we chose the most effective model for network intrusion based on a set of benchmarked performance analysis criteria. Our method produces excellent results when tested on two datasets, KDDCUP'99 and CIC-MalMem-2022, with an accuracy of 99.99% and 100% for KDDCUP'99 and CIC-MalMem-2022, respectively, and no overfitting or Type-1 and Type-2 issues.  ( 2 min )
    Jump Interval-Learning for Individualized Decision Making. (arXiv:2111.08885v2 [stat.ME] UPDATED)
    An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.  ( 2 min )
    Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. (arXiv:2301.12017v1 [cs.CL])
    Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this work, we fully investigate the feasibility of using INT4 quantization for language models, and show that using INT4 introduces no or negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using INT4, we develop a highly-optimized end-to-end INT4 encoder inference pipeline supporting different quantization strategies. Our INT4 pipeline is $8.5\times$ faster for latency-oriented scenarios and up to $3\times$ for throughput-oriented scenarios compared to the inference of FP16, and improves the SOTA BERT INT8 performance from FasterTransformer by up to $1.7\times$. We also provide insights into the failure cases when applying INT4 to decoder-only models, and further explore the compatibility of INT4 quantization with other compression techniques, like pruning and layer reduction.  ( 2 min )
    On the Lipschitz Constant of Deep Networks and Double Descent. (arXiv:2301.12309v1 [cs.LG])
    Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors -- namely loss landscape curvature and distance of parameters from initialization -- respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novels insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice.  ( 2 min )
    Continual Graph Learning: A Survey. (arXiv:2301.12230v1 [cs.LG])
    Research on continual learning (CL) mainly focuses on data represented in the Euclidean space, while research on graph-structured data is scarce. Furthermore, most graph learning models are tailored for static graphs. However, graphs usually evolve continually in the real world. Catastrophic forgetting also emerges in graph learning models when being trained incrementally. This leads to the need to develop robust, effective and efficient continual graph learning approaches. Continual graph learning (CGL) is an emerging area aiming to realize continual learning on graph-structured data. This survey is written to shed light on this emerging area. It introduces the basic concepts of CGL and highlights two unique challenges brought by graphs. Then it reviews and categorizes recent state-of-the-art approaches, analyzing their strategies to tackle the unique challenges in CGL. Besides, it discusses the main concerns in each family of CGL methods, offering potential solutions. Finally, it explores the open issues and potential applications of CGL.  ( 2 min )
    Zero-shot causal learning. (arXiv:2301.12292v1 [cs.LG])
    Predicting how different interventions will causally affect a specific individual is important in a variety of domains such as personalized medicine, public policy, and online marketing. However, most existing causal methods cannot generalize to predicting the effects of previously unseen interventions (e.g., a newly invented drug), because they require data for individuals who received the intervention. Here, we consider zero-shot causal learning: predicting the personalized effects of novel, previously unseen interventions. To tackle this problem, we propose CaML, a causal meta-learning framework which formulates the personalized prediction of each intervention's effect as a task. Rather than training a separate model for each intervention, CaML trains as a single meta-model across thousands of tasks, each constructed by sampling an intervention and individuals who either did or did not receive it. By leveraging both intervention information (e.g., a drug's attributes) and individual features (e.g., a patient's history), CaML is able to predict the personalized effects of unseen interventions. Experimental results on real world datasets in large-scale medical claims and cell-line perturbations demonstrate the effectiveness of our approach. Most strikingly, CaML zero-shot predictions outperform even strong baselines which have direct access to data of considered target interventions.  ( 2 min )
    Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. (arXiv:2005.05409v2 [math.OC] UPDATED)
    Optimal control of diffusion processes is intimately connected to the problem of solving certain Hamilton-Jacobi-Bellman equations. Building on recent machine learning inspired approaches towards high-dimensional PDEs, we investigate the potential of $\textit{iterative diffusion optimisation}$ techniques, in particular considering applications in importance sampling and rare event simulation, and focusing on problems without diffusion control, with linearly controlled drift and running costs that depend quadratically on the control. More generally, our methods apply to nonlinear parabolic PDEs with a certain shift invariance. The choice of an appropriate loss function being a central element in the algorithmic design, we develop a principled framework based on divergences between path measures, encompassing various existing methods. Motivated by connections to forward-backward SDEs, we propose and study the novel $\textit{log-variance}$ divergence, showing favourable properties of corresponding Monte Carlo estimators. The promise of the developed approach is exemplified by a range of high-dimensional and metastable numerical examples.  ( 2 min )
    Alignment with human representations supports robust few-shot learning. (arXiv:2301.11990v1 [cs.LG])
    Should we care whether AI systems have representations of the world that are similar to those of humans? We provide an information-theoretic analysis that suggests that there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks. We confirm this prediction empirically, finding such a relationship in an analysis of the performance of 491 computer vision models. We also show that highly-aligned models are more robust to both adversarial attacks and domain shifts. Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.  ( 2 min )
    Turbulence control in plane Couette flow using low-dimensional neural ODE-based models and deep reinforcement learning. (arXiv:2301.12098v1 [physics.flu-dyn])
    The high dimensionality and complex dynamics of turbulent flows remain an obstacle to the discovery and implementation of control strategies. Deep reinforcement learning (RL) is a promising avenue for overcoming these obstacles, but requires a training phase in which the RL agent iteratively interacts with the flow environment to learn a control policy, which can be prohibitively expensive when the environment involves slow experiments or large-scale simulations. We overcome this challenge using a framework we call "DManD-RL" (data-driven manifold dynamics-RL), which generates a data-driven low-dimensional model of our system that we use for RL training. With this approach, we seek to minimize drag in a direct numerical simulation (DNS) of a turbulent minimal flow unit of plane Couette flow at Re=400 using two slot jets on one wall. We obtain, from DNS data with $\mathcal{O}(10^5)$ degrees of freedom, a 25-dimensional DManD model of the dynamics by combining an autoencoder and neural ordinary differential equation. Using this model as the environment, we train an RL control agent, yielding a 440-fold speedup over training on the DNS, with equivalent control performance. The agent learns a policy that laminarizes 84% of unseen DNS test trajectories within 900 time units, significantly outperforming classical opposition control (58%), despite the actuation authority being much more restricted. The agent often achieves laminarization through a counterintuitive strategy that drives the formation of two low-speed streaks, with a spanwise wavelength that is too small to be self-sustaining. The agent demonstrates the same performance when we limit observations to wall shear rate.  ( 2 min )
    Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling. (arXiv:2301.12050v1 [cs.LG])
    Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difficult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that is tested and verified during exploration, to improve sample efficiency in embodied RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM on the basis of its experiences. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.  ( 2 min )
    Harnessing the Power of Decision Trees to Detect IoT Malware. (arXiv:2301.12039v1 [cs.CR])
    Due to its simple installation and connectivity, the Internet of Things (IoT) is susceptible to malware attacks. Being able to operate autonomously. As IoT devices have become more prevalent, they have become the most tempting targets for malware. Weak, guessable, or hard-coded passwords, and a lack of security measures contribute to these vulnerabilities along with insecure network connections and outdated update procedures. To understand IoT malware, current methods and analysis ,using static methods, are ineffective. The field of deep learning has made great strides in recent years due to their tremendous data mining, learning, and expression capabilities, cybersecurity has enjoyed tremendous growth in recent years. As a result, malware analysts will not have to spend as much time analyzing malware. In this paper, we propose a novel detection and analysis method that harnesses the power and simplicity of decision trees. The experiments are conducted using a real word dataset, MaleVis which is a publicly available dataset. Based on the results, we show that our proposed approach outperforms existing state-of-the-art solutions in that it achieves 97.23% precision and 95.89% recall in terms of detection and classification. A specificity of 96.58%, F1-score of 96.40%, an accuracy of 96.43.  ( 2 min )
    Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction. (arXiv:2301.12068v1 [cs.LG])
    Pre-training methods on proteins are recently gaining interest, leveraging either protein sequences or structures, while modeling their joint energy landscape is largely unexplored. In this work, inspired by the success of denoising diffusion models, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure multimodal diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the multimodal diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a physics-inspired method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. The source code will be released upon acceptance.  ( 2 min )
    On the Feasibility of Machine Learning Augmented Magnetic Resonance for Point-of-Care Identification of Disease. (arXiv:2301.11962v1 [cs.LG])
    Early detection of many life-threatening diseases (e.g., prostate and breast cancer) within at-risk population can improve clinical outcomes and reduce cost of care. While numerous disease-specific "screening" tests that are closer to Point-of-Care (POC) are in use for this task, their low specificity results in unnecessary biopsies, leading to avoidable patient trauma and wasteful healthcare spending. On the other hand, despite the high accuracy of Magnetic Resonance (MR) imaging in disease diagnosis, it is not used as a POC disease identification tool because of poor accessibility. The root cause of poor accessibility of MR stems from the requirement to reconstruct high-fidelity images, as it necessitates a lengthy and complex process of acquiring large quantities of high-quality k-space measurements. In this study we explore the feasibility of an ML-augmented MR pipeline that directly infers the disease sidestepping the image reconstruction process. We hypothesise that the disease classification task can be solved using a very small tailored subset of k-space data, compared to image reconstruction. Towards that end, we propose a method that performs two tasks: 1) identifies a subset of the k-space that maximizes disease identification accuracy, and 2) infers the disease directly using the identified k-space subset, bypassing the image reconstruction step. We validate our hypothesis by measuring the performance of the proposed system across multiple diseases and anatomies. We show that comparable performance to image-based classifiers, trained on images reconstructed with full k-space data, can be achieved using small quantities of data: 8% of the data for detecting multiple abnormalities in prostate and brain scans, and 5% of the data for knee abnormalities. To better understand the proposed approach and instigate future research, we provide an extensive analysis and release code.  ( 2 min )
    RCsearcher: Reaction Center Identification in Retrosynthesis via Deep Q-Learning. (arXiv:2301.12071v1 [cs.LG])
    The reaction center consists of atoms in the product whose local properties are not identical to the corresponding atoms in the reactants. Prior studies on reaction center identification are mainly on semi-templated retrosynthesis methods. Moreover, they are limited to single reaction center identification. However, many reaction centers are comprised of multiple bonds or atoms in reality. We refer to it as the multiple reaction center. This paper presents RCsearcher, a unified framework for single and multiple reaction center identification that combines the advantages of the graph neural network and deep reinforcement learning. The critical insight in this framework is that the single or multiple reaction center must be a node-induced subgraph of the molecular product graph. At each step, it considers choosing one node in the molecular product graph and adding it to the explored node-induced subgraph as an action. Comprehensive experiments demonstrate that RCsearcher consistently outperforms other baselines and can extrapolate the reaction center patterns that have not appeared in the training set. Ablation experiments verify the effectiveness of individual components, including the beam search and one-hop constraint of action space.  ( 2 min )
    Predicting Students' Exam Scores Using Physiological Signals. (arXiv:2301.12051v1 [cs.LG])
    While acute stress has been shown to have both positive and negative effects on performance, not much is known about the impacts of stress on students grades during examinations. To answer this question, we examined whether a correlation could be found between physiological stress signals and exam performance. We conducted this study using multiple physiological signals of ten undergraduate students over three different exams. The study focused on three signals, i.e., skin temperature, heart rate, and electrodermal activity. We extracted statistics as features and fed them into a variety of binary classifiers to predict relatively higher or lower grades. Experimental results showed up to 0.81 ROC-AUC with k-nearest neighbor algorithm among various machine learning algorithms.  ( 2 min )
    Restricted Orthogonal Gradient Projection for Continual Learning. (arXiv:2301.12131v1 [cs.LG])
    Continual learning aims to avoid catastrophic forgetting and effectively leverage learned experiences to master new knowledge. Existing gradient projection approaches impose hard constraints on the optimization space for new tasks to minimize interference, which simultaneously hinders forward knowledge transfer. To address this issue, recent methods reuse frozen parameters with a growing network, resulting in high computational costs. Thus, it remains a challenge whether we can improve forward knowledge transfer for gradient projection approaches using a fixed network architecture. In this work, we propose the Restricted Orthogonal Gradient prOjection (ROGO) framework. The basic idea is to adopt a restricted orthogonal constraint allowing parameters optimized in the direction oblique to the whole frozen space to facilitate forward knowledge transfer while consolidating previous knowledge. Our framework requires neither data buffers nor extra parameters. Extensive experiments have demonstrated the superiority of our framework over several strong baselines. We also provide theoretical guarantees for our relaxing strategy.  ( 2 min )
    ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. (arXiv:2301.12040v1 [q-bio.BM])
    Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.  ( 2 min )
    A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents. (arXiv:2301.11977v1 [cs.AI])
    To perform well, Deep Reinforcement Learning (DRL) methods require significant memory resources and computational time. Also, sometimes these systems need additional environment information to achieve a good reward. However, it is more important for many applications and devices to reduce memory usage and computational times than to achieve the maximum reward. This paper presents a modified DRL method that performs reasonably well with compressed imagery data without requiring additional environment information and also uses less memory and time. We have designed a lightweight Convolutional Neural Network (CNN) with a variant of the Q-network that efficiently takes preprocessed image data as input and uses less memory. Furthermore, we use a simple reward mechanism and small experience replay memory so as to provide only the minimum necessary information. Our modified DRL method enables our autonomous agent to play Snake, a classical control game. The results show our model can achieve similar performance as other DRL methods.  ( 2 min )
    Minimizing Trajectory Curvature of ODE-based Generative Models. (arXiv:2301.12003v1 [cs.LG])
    Recent ODE/SDE-based generative models, such as diffusion models and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at https://github.com/sangyun884/fast-ode.  ( 2 min )
    Physics-informed Neural Network: The Effect of Reparameterization in Solving Differential Equations. (arXiv:2301.12118v1 [cs.LG])
    Differential equations are used to model and predict the behaviour of complex systems in a wide range of fields, and the ability to solve them is an important asset for understanding and predicting the behaviour of these systems. Complicated physics mostly involves difficult differential equations, which are hard to solve analytically. In recent years, physics-informed neural networks have been shown to perform very well in solving systems with various differential equations. The main ways to approximate differential equations are through penalty function and reparameterization. Most researchers use penalty functions rather than reparameterization due to the complexity of implementing reparameterization. In this study, we quantitatively compare physics-informed neural network models with and without reparameterization using the approximation error. The performance of reparameterization is demonstrated based on two benchmark mechanical engineering problems, a one-dimensional bar problem and a two-dimensional bending beam problem. Our results show that when dealing with complex differential equations, applying reparameterization results in a lower approximation error.  ( 2 min )
    Reduced-Order Autodifferentiable Ensemble Kalman Filters. (arXiv:2301.11961v1 [stat.ML])
    This paper introduces a computational framework to reconstruct and forecast a partially observed state that evolves according to an unknown or expensive-to-simulate dynamical system. Our reduced-order autodifferentiable ensemble Kalman filters (ROAD-EnKFs) learn a latent low-dimensional surrogate model for the dynamics and a decoder that maps from the latent space to the state space. The learned dynamics and decoder are then used within an ensemble Kalman filter to reconstruct and forecast the state. Numerical experiments show that if the state dynamics exhibit a hidden low-dimensional structure, ROAD-EnKFs achieve higher accuracy at lower computational cost compared to existing methods. If such structure is not expressed in the latent state dynamics, ROAD-EnKFs achieve similar accuracy at lower cost, making them a promising approach for surrogate state reconstruction and forecasting.  ( 2 min )
    Meta Temporal Point Processes. (arXiv:2301.12023v1 [cs.LG])
    A temporal point process (TPP) is a stochastic process where its realization is a sequence of discrete events in time. Recent work in TPPs model the process using a neural network in a supervised learning framework, where a training set is a collection of all the sequences. In this work, we propose to train TPPs in a meta learning framework, where each sequence is treated as a different task, via a novel framing of TPPs as neural processes (NPs). We introduce context sets to model TPPs as an instantiation of NPs. Motivated by attentive NP, we also introduce local history matching to help learn more informative features. We demonstrate the potential of the proposed method on popular public benchmark datasets and tasks, and compare with state-of-the-art TPP methods.  ( 2 min )
    Variational Latent Branching Model for Off-Policy Evaluation. (arXiv:2301.12056v1 [cs.LG])
    Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.  ( 2 min )
    ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts. (arXiv:2301.12171v1 [cs.CV])
    Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.  ( 2 min )
    Folded Optimization for End-to-End Model-Based Learning. (arXiv:2301.12047v1 [cs.LG])
    The integration of constrained optimization models as components in deep networks has led to promising advances in both these domains. A primary challenge in this setting is backpropagation through the optimization mapping, which typically lacks a closed form. A common approach is unrolling, which relies on automatic differentiation through the operations of an iterative solver. While flexible and general, unrolling can encounter accuracy and efficiency issues in practice. These issues can be avoided by differentiating the optimization mapping analytically, but current frameworks impose rigid requirements on the optimization problem's form. This paper provides theoretical insights into the backpropagation of unrolled optimizers, which lead to a system for generating equivalent but efficiently solvable analytical models. Additionally, it proposes a unifying view of unrolling and analytical differentiation through constrained optimization mappings. Experiments over various structured prediction and decision-focused learning tasks illustrate the potential of the approach both computationally and in terms of enhanced expressiveness.  ( 2 min )
    Information loss from dimensionality reduction in 5D-Gaussian spectral data. (arXiv:2301.11923v1 [physics.data-an])
    Understanding the loss of information in spectral analytics is a crucial first step towards finding root causes for failures and uncertainties using spectral data in artificial intelligence models built from modern complex data science applications. Here, we show from a very simple entropy model analysis with quantum statistics of spectral data, that the relative loss of information from dimensionality reduction due to projection of an initial five-dimensional state onto two-dimensional diagrams is less than one percent in the parameter range of small data sets with sample sizes on the order of few hundreds data samples. From our analysis, we also conclude that the density and expectation value of the entropy probability distribution increases with the sample number and sample size using artificial data models derived from random sampling Monte-Carlo simulation methods.  ( 2 min )
    On the Connection Between MPNN and Graph Transformer. (arXiv:2301.11956v1 [cs.LG])
    Graph Transformer (GT) recently has emerged as a new paradigm of graph learning algorithms, outperforming the previously popular Message Passing Neural Network (MPNN) on multiple benchmarks. Previous work (Kim et al., 2022) shows that with proper position embedding, GT can approximate MPNN arbitrarily well, implying that GT is at least as powerful as MPNN. In this paper, we study the inverse connection and show that MPNN with virtual node (VN), a commonly used heuristic with little theoretical understanding, is powerful enough to arbitrarily approximate the self-attention layer of GT. In particular, we first show that if we consider one type of linear transformer, the so-called Performer/Linear Transformer (Choromanski et al., 2020; Katharopoulos et al., 2020), then MPNN + VN with only O(1) depth and O(1) width can approximate a self-attention layer in Performer/Linear Transformer. Next, via a connection between MPNN + VN and DeepSets, we prove the MPNN + VN with O(n^d) width and O(1) depth can approximate the self-attention layer arbitrarily well, where d is the input feature dimension. Lastly, under some assumptions, we provide an explicit construction of MPNN + VN with O(1) width and O(n) depth approximating the self-attention layer in GT arbitrarily well. On the empirical side, we demonstrate that 1) MPNN + VN is a surprisingly strong baseline, outperforming GT on the recently proposed Long Range Graph Benchmark (LRGB) dataset, 2) our MPNN + VN improves over early implementation on a wide range of OGB datasets and 3) MPNN + VN outperforms Linear Transformer and MPNN on the climate modeling task.  ( 2 min )
    Neural Relation Graph for Identifying Problematic Data. (arXiv:2301.12321v1 [cs.LG])
    Diagnosing and cleaning datasets are crucial for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is difficult due to the presence of complex issues, such as label errors or under-representation of certain types. In this paper, we propose a novel approach for identifying problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. We develop an efficient algorithm for detecting label errors and outlier data points based on the relational graph structure of the dataset. We further introduce a visualization tool for contextualizing data points, which can serve as an effective tool for interactively diagnosing datasets. We evaluate label error and out-of-distribution detection performances on large-scale image and language domain tasks, including ImageNet and GLUE benchmarks, and demonstrate the effectiveness of our approach for debugging datasets and building robust machine learning systems.  ( 2 min )
    Accelerated Training of Physics-Informed Neural Networks (PINNs) using Meshless Discretizations. (arXiv:2205.09332v6 [cs.LG] UPDATED)
    We present a new technique for the accelerated training of physics-informed neural networks (PINNs): discretely-trained PINNs (DT-PINNs). The repeated computation of partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. The use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries. Additionally, though traditional PINNs (vanilla-PINNs) are typically stored and trained in 32-bit floating-point (fp32) on the GPU, we show that for DT-PINNs, using fp64 on the GPU leads to significantly faster training times than fp32 vanilla-PINNs with comparable accuracy. We demonstrate the efficiency and accuracy of DT-PINNs via a series of experiments. First, we explore the effect of network depth on both numerical and automatic differentiation of a neural network with random weights and show that RBF-FD approximations of third-order accuracy and above are more efficient while being sufficiently accurate. We then compare the DT-PINNs to vanilla-PINNs on both linear and nonlinear Poisson equations and show that DT-PINNs achieve similar losses with 2-4x faster training times on a consumer GPU. Finally, we also demonstrate that similar results can be obtained for the PINN solution to the heat equation (a space-time problem) by discretizing the spatial derivatives using RBF-FD and using automatic differentiation for the temporal derivative. Our results show that fp64 DT-PINNs offer a superior cost-accuracy profile to fp32 vanilla-PINNs.  ( 3 min )
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v2 [stat.ML] UPDATED)
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders.  ( 2 min )
    Informational Diversity and Affinity Bias in Team Growth Dynamics. (arXiv:2301.12091v1 [cs.GT])
    Prior work has provided strong evidence that, within organizational settings, teams that bring a diversity of information and perspectives to a task are more effective than teams that do not. If this form of informational diversity confers performance advantages, why do we often see largely homogeneous teams in practice? One canonical argument is that the benefits of informational diversity are in tension with affinity bias. To better understand the impact of this tension on the makeup of teams, we analyze a sequential model of team formation in which individuals care about their team's performance (captured in terms of accurately predicting some future outcome based on a set of features) but experience a cost as a result of interacting with teammates who use different approaches to the prediction task. Our analysis of this simple model reveals a set of subtle behaviors that team-growth dynamics can exhibit: (i) from certain initial team compositions, they can make progress toward better performance but then get stuck partway to optimally diverse teams; while (ii) from other initial compositions, they can also move away from this optimal balance as the majority group tries to crowd out the opinions of the minority. The initial composition of the team can determine whether the dynamics will move toward or away from performance optimality, painting a path-dependent picture of inefficiencies in team compositions. Our results formalize a fundamental limitation of utility-based motivations to drive informational diversity in organizations and hint at interventions that may improve informational diversity and performance simultaneously.  ( 2 min )
    Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits. (arXiv:2211.05964v2 [stat.ML] UPDATED)
    We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling algorithm using special classes of sparsity-inducing priors (e.g., spike-and-slab) to model the unknown parameter and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high-dimensional and sparse contextual bandits. For faster computation, we use variational inference instead of Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. Extensive simulations demonstrate the improved performance of our proposed algorithm over existing ones.  ( 2 min )
    Adversarial Networks and Machine Learning for File Classification. (arXiv:2301.11964v1 [cs.LG])
    Correctly identifying the type of file under examination is a critical part of a forensic investigation. The file type alone suggests the embedded content, such as a picture, video, manuscript, spreadsheet, etc. In cases where a system owner might desire to keep their files inaccessible or file type concealed, we propose using an adversarially-trained machine learning neural network to determine a file's true type even if the extension or file header is obfuscated to complicate its discovery. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types. We also compared our network against a traditional standalone neural network and three other machine learning algorithms. The adversarially-trained network proved to be the most precise file classifier especially in scenarios with few supervised samples available. Our implementation of a file classifier using an SGAN is implemented on GitHub (https://ksaintg.github.io/SGAN-File-Classier).  ( 2 min )
    Neural Gas Network Image Features and Segmentation for Brain Tumor Detection Using Magnetic Resonance Imaging Data. (arXiv:2301.12176v1 [eess.IV])
    Accurate detection of brain tumors could save lots of lives and increasing the accuracy of this binary classification even as much as a few percent has high importance. Neural Gas Networks (NGN) is a fast, unsupervised algorithm that could be used in data clustering, image pattern recognition, and image segmentation. In this research, we used the metaheuristic Firefly Algorithm (FA) for image contrast enhancement as pre-processing and NGN weights for feature extraction and segmentation of Magnetic Resonance Imaging (MRI) data on two brain tumor datasets from the Kaggle platform. Also, tumor classification is conducted by Support Vector Machine (SVM) classification algorithms and compared with a deep learning technique plus other features in train and test phases. Additionally, NGN tumor segmentation is evaluated by famous performance metrics such as Accuracy, F-measure, Jaccard, and more versus ground truth data and compared with traditional segmentation techniques. The proposed method is fast and precise in both tasks of tumor classification and segmentation compared with other methods. A classification accuracy of 95.14 % and segmentation accuracy of 0.977 is achieved by the proposed method.  ( 2 min )
    Byte Pair Encoding for Symbolic Music. (arXiv:2301.11975v1 [cs.LG])
    The symbolic music modality is nowadays mostly represented as discrete and used with sequential models such as Transformers, for deep learning tasks. Recent research put efforts on the tokenization, i.e. the conversion of data into sequences of integers intelligible to such models. This can be achieved by many ways as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations are based on small vocabularies describing the note attributes and time events, resulting in fairly long token sequences. In this paper, we show how Byte Pair Encoding (BPE) can improve the results of deep learning models while improving its performances. We experiment on music generation and composer classification, and study the impact of BPE on how models learn the embeddings, and show that it can help to increase their isotropy, i.e., the uniformity of the variance of their positions in the space.  ( 2 min )
    Arbitrarily Accurate Classification Applied to Specific Emitter Identification. (arXiv:2211.10379v2 [eess.SP] UPDATED)
    This article introduces a method of evaluating subsamples until any prescribed level of classification accuracy is attained, thus obtaining arbitrary accuracy. A logarithmic reduction in error rate is obtained with a linear increase in sample count. The technique is applied to specific emitter identification on a published dataset of physically recorded over-the-air signals from 16 ostensibly identical high-performance radios. The technique uses a multi-channel deep learning convolutional neural network acting on the bispectra of I/Q signal subsamples each consisting of 56 parts per million (ppm) of the original signal duration. High levels of accuracy are obtained with minimal computation time: in this application, each addition of eight samples decreases error by one order of magnitude.  ( 2 min )
    Multi-task Highly Adaptive Lasso. (arXiv:2301.12029v1 [stat.ML])
    We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm attains a powerful dimension-free convergence rate of $o_p(n^{-1/4})$ or better. We show that MT-HAL outperforms sparsity-based MTL competitors across a wide range of simulation studies, including settings with nonlinear and linear relationships, varying levels of sparsity and task correlations, and different numbers of covariates and sample size.  ( 2 min )
    Skin Lesion Analysis: A Survey, Systematic Review, and Future Trends. (arXiv:2208.12232v2 [eess.IV] UPDATED)
    The Computer-aided Diagnosis or Detection (CAD) approach for skin lesion analysis is an emerging field of research that has the potential to alleviate the burden and cost of skin cancer screening. Researchers have recently indicated increasing interest in developing such CAD systems, with the intention of providing a user-friendly tool to dermatologists to reduce the challenges encountered or associated with manual inspection. This article aims to provide a comprehensive literature survey and review of a total of 594 publications (356 for skin lesion segmentation and 238 for skin lesion classification) published between 2011 and 2022. These articles are analyzed and summarized in a number of different ways to contribute vital information regarding the methods for the development of CAD systems. These ways include relevant and essential definitions and theories, input data (dataset utilization, preprocessing, augmentations, and fixing imbalance problems), method configuration (techniques, architectures, module frameworks, and losses), training tactics (hyperparameter settings), and evaluation criteria. We intend to investigate a variety of performance-enhancing approaches, including ensemble and post-processing. We also discuss these dimensions to reveal their current trends based on utilization frequencies. In addition, we highlight the primary difficulties associated with evaluating skin lesion segmentation and classification systems using minimal datasets, as well as the potential solutions to these difficulties. Findings, recommendations, and trends are disclosed to inform future research on developing an automated and robust CAD system for skin lesion analysis.  ( 2 min )
    Neighborhood Gradient Clustering: An Efficient Decentralized Learning Method for Non-IID Data Distributions. (arXiv:2209.14390v4 [cs.LG] UPDATED)
    Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by $32 \times$. We theoretically analyze the convergence rate of the proposed algorithm and demonstrate its efficiency over non-IID data sampled from {various vision and language} datasets trained. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by $0-6\%$) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by $1-35\%$ without additional communication cost.  ( 3 min )
    Principled Acceleration of Iterative Numerical Methods Using Machine Learning. (arXiv:2206.08594v2 [math.NA] UPDATED)
    Iterative methods are ubiquitous in large-scale scientific computing applications, and a number of approaches based on meta-learning have been recently proposed to accelerate them. However, a systematic study of these approaches and how they differ from meta-learning is lacking. In this paper, we propose a framework to analyze such learning-based acceleration approaches, where one can immediately identify a departure from classical meta-learning. We show that this departure may lead to arbitrary deterioration of model performance. Based on our analysis, we introduce a novel training method for learning-based acceleration of iterative methods. Furthermore, we theoretically prove that the proposed method improves upon the existing methods, and demonstrate its significant advantage and versatility through various numerical applications.  ( 2 min )
    Quantum Ridgelet Transform: Winning Lottery Ticket of Neural Networks with Quantum Computation. (arXiv:2301.11936v1 [quant-ph])
    Ridgelet transform has been a fundamental mathematical tool in the theoretical studies of neural networks. However, the practical applicability of ridgelet transform to conducting learning tasks was limited since its numerical implementation by conventional classical computation requires an exponential runtime $\exp(O(D))$ as data dimension $D$ increases. To address this problem, we develop a quantum ridgelet transform (QRT), which implements the ridgelet transform of a quantum state within a linear runtime $O(D)$ of quantum computation. As an application, we also show that one can use QRT as a fundamental subroutine for quantum machine learning (QML) to efficiently find a sparse trainable subnetwork of large shallow wide neural networks without conducting large-scale optimization of the original network. This application discovers an efficient way in this regime to demonstrate the lottery ticket hypothesis on finding such a sparse trainable neural network. These results open an avenue of QML for accelerating learning tasks with commonly used classical neural networks.  ( 2 min )
    Supervision Complexity and its Role in Knowledge Distillation. (arXiv:2301.12245v1 [cs.LG])
    Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.  ( 2 min )
    Microstructural parameter estimation using spherical convolutional neural networks. (arXiv:2211.09887v2 [eess.IV] UPDATED)
    Diffusion-weighted magnetic resonance imaging is sensitive to the microstructural properties of brain tissue. However, estimating clinically and scientifically relevant microstructural properties from the measured signals remains a highly challenging inverse problem that deep learning may help solve. This study investigated if recently developed orientationally invariant spherical convolutional neural networks can improve microstructural parameter estimation. A spherical convolutional neural network was trained to predict the ground-truth parameter values from simulated noisy data and applied to imaging data acquired in a clinical setting to generate microstructural parameter maps. The spherical convolutional neural network was more accurate and less orientationally variant than the benchmark methods (multi-layer perceptrons and the spherical mean technique). Our results show that spherical convolutional neural networks can be a compelling alternative to predicting parameters from powder-averaged data (i.e., data averaged over the acquired diffusion encoding directions). While we focused on constrained two- and three-compartment models of neuronal tissue, the presented network and training pipeline are generalizable and can be used to estimate the parameters of other Gaussian compartment models.  ( 2 min )
    Flip Initial Features: Generalization of Neural Networks Under Sparse Features for Semi-supervised Node Classification. (arXiv:2211.15081v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been widely used under semi-supervised settings. Prior studies have mainly focused on finding appropriate graph filters (e.g., aggregation schemes) to generalize well for both homophilic and heterophilic graphs. Even though these approaches are essential and effective, they still suffer from the sparsity in initial node features inherent in the bag-of-words representation. Common in semi-supervised learning where the training samples often fail to cover the entire dimensions of graph filters (hyperplanes), this can precipitate over-fitting of specific dimensions in the first projection matrix. To deal with this problem, we suggest a simple and novel strategy; create additional space by flipping the initial features and hyperplane simultaneously. Training in both the original and in the flip space can provide precise updates of learnable parameters. To the best of our knowledge, this is the first attempt that effectively moderates the overfitting problem in GNN. Extensive experiments on real-world datasets demonstrate that the proposed technique improves the node classification accuracy up to 40.2 %  ( 2 min )
    Neural Error Mitigation of Near-Term Quantum Simulations. (arXiv:2105.08086v2 [quant-ph] UPDATED)
    Near-term quantum computers provide a promising platform for finding ground states of quantum systems, which is an essential task in physics, chemistry, and materials science. Near-term approaches, however, are constrained by the effects of noise as well as the limited resources of near-term quantum hardware. We introduce "neural error mitigation," which uses neural networks to improve estimates of ground states and ground-state observables obtained using near-term quantum simulations. To demonstrate our method's broad applicability, we employ neural error mitigation to find the ground states of the H$_2$ and LiH molecular Hamiltonians, as well as the lattice Schwinger model, prepared via the variational quantum eigensolver (VQE). Our results show that neural error mitigation improves numerical and experimental VQE computations to yield low energy errors, high fidelities, and accurate estimations of more-complex observables like order parameters and entanglement entropy, without requiring additional quantum resources. Furthermore, neural error mitigation is agnostic with respect to the quantum state preparation algorithm used, the quantum hardware it is implemented on, and the particular noise channel affecting the experiment, contributing to its versatility as a tool for quantum simulation.  ( 2 min )
    Reversible Gromov-Monge Sampler for Simulation-Based Inference. (arXiv:2109.14090v4 [stat.ME] UPDATED)
    This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d.\ samples, circumventing the usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogeneous metric measure spaces $(\cX, \mu, c_{\cX})$ and $(\cY, \nu, c_{\cY})$ from empirical data sets, with estimated maps that approximately push forward one measure $\mu$ to the other $\nu$, and vice versa. We study the analytic properties of the RGM distance and derive that under mild conditions, RGM equals the classic Gromov-Wasserstein distance. Curiously, drawing a connection to Brenier's polar factorization, we show that the RGM sampler induces bias towards strong isomorphism with proper choices of $c_{\cX}$ and $c_{\cY}$. Statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.  ( 2 min )
    Cross-Subject Deep Transfer Models for Evoked Potentials in Brain-Computer Interface. (arXiv:2301.12322v1 [cs.LG])
    Brain Computer Interface (BCI) technologies have the potential to improve the lives of millions of people around the world, whether through assistive technologies or clinical diagnostic tools. Despite advancements in the field, however, at present consumer and clinical viability remains low. A key reason for this is that many of the existing BCI deployments require substantial data collection per end-user, which can be cumbersome, tedious, and error-prone to collect. We address this challenge via a deep learning model, which, when trained across sufficient data from multiple subjects, offers reasonable performance out-of-the-box, and can be customized to novel subjects via a transfer learning process. We demonstrate the fundamental viability of our approach by repurposing an older but well-curated electroencephalography (EEG) dataset and benchmarking against several common approaches/techniques. We then partition this dataset into a transfer learning benchmark and demonstrate that our approach significantly reduces data collection burden per-subject. This suggests that our model and methodology may yield improvements to BCI technologies and enhance their consumer/clinical viability.  ( 2 min )
    DALI: Dynamically Adjusted Label Importance for Noisy Partial Label Learning. (arXiv:2301.12077v1 [cs.CV])
    Noisy partial label learning (noisy PLL) is an important branch of weakly supervised learning. Unlike PLL where the ground-truth label must reside in the candidate set, noisy PLL relaxes this constraint and allows the ground-truth label may not be in the candidate set. To address this problem, existing works attempt to detect noisy samples and estimate the ground-truth label for each noisy sample. However, detection errors are inevitable, and these errors will accumulate during training and continuously affect model optimization. To address this challenge, we propose a novel framework for noisy PLL, called ``Dynamically Adjusted Label Importance (DALI)''. It aims to reduce the negative impact of detection errors by trading off the initial candidate set and model outputs with theoretical guarantees. Experimental results on multiple datasets demonstrate that our DALI succeeds over existing state-of-the-art approaches on noisy PLL. Our code will soon be publicly available.  ( 2 min )
    Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering. (arXiv:2301.12318v1 [cs.CR])
    Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.  ( 2 min )
    Deciphering the Projection Head: Representation Evaluation Self-supervised Learning. (arXiv:2301.12189v1 [cs.LG])
    Self-supervised learning (SSL) aims to learn intrinsic features without labels. Despite the diverse architectures of SSL methods, the projection head always plays an important role in improving the performance of the downstream task. In this work, we systematically investigate the role of the projection head in SSL. Specifically, the projection head targets the uniformity part of SSL, which pushes the dissimilar samples away from each other, thus enabling the encoder to focus on extracting semantic features. Based on this understanding, we propose a Representation Evaluation Design (RED) in SSL models in which a shortcut connection between the representation and the projection vectors is built. Extensive experiments with different architectures, including SimCLR, MoCo-V2, and SimSiam, on various datasets, demonstrate that the representation evaluation design can consistently improve the baseline models in the downstream tasks. The learned representation from the RED-SSL models shows superior robustness to unseen augmentations and out-of-distribution data.  ( 2 min )
    In-Distribution Barrier Functions: Self-Supervised Policy Filters that Avoid Out-of-Distribution States. (arXiv:2301.12012v1 [cs.RO])
    Learning-based control approaches have shown great promise in performing complex tasks directly from high-dimensional perception data for real robotic systems. Nonetheless, the learned controllers can behave unexpectedly if the trajectories of the system divert from the training data distribution, which can compromise safety. In this work, we propose a control filter that wraps any reference policy and effectively encourages the system to stay in-distribution with respect to offline-collected safe demonstrations. Our methodology is inspired by Control Barrier Functions (CBFs), which are model-based tools from the nonlinear control literature that can be used to construct minimally invasive safe policy filters. While existing methods based on CBFs require a known low-dimensional state representation, our proposed approach is directly applicable to systems that rely solely on high-dimensional visual observations by learning in a latent state-space. We demonstrate that our method is effective for two different visuomotor control tasks in simulation environments, including both top-down and egocentric view settings.  ( 2 min )
    Learning Optimal Features via Partial Invariance. (arXiv:2301.12067v1 [cs.LG])
    Learning models that are robust to test-time distribution shifts is a key concern in domain generalization, and in the wider context of their real-life applicability. Invariant Risk Minimization (IRM) is one particular framework that aims to learn deep invariant features from multiple domains and has subsequently led to further variants. A key assumption for the success of these methods requires that the underlying causal mechanisms/features remain invariant across domains and the true invariant features be sufficient to learn the optimal predictor. In practical problem settings, these assumptions are often not satisfied, which leads to IRM learning a sub-optimal predictor for that task. In this work, we propose the notion of partial invariance as a relaxation of the IRM framework. Under our problem setting, we first highlight the sub-optimality of the IRM solution. We then demonstrate how partitioning the training domains, assuming access to some meta-information about the domains, can help improve the performance of invariant models via partial invariance. Finally, we conduct several experiments, both in linear settings as well as with classification tasks in language and images with deep models, which verify our conclusions.  ( 2 min )
    Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. (arXiv:2301.12083v1 [cs.LG])
    Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.  ( 2 min )
    Using uncertainty-aware machine learning models to study aerosol-cloud interactions. (arXiv:2301.11921v1 [physics.data-an])
    Aerosol-cloud interactions (ACI) include various effects that result from aerosols entering a cloud, and affecting cloud properties. In general, an increase in aerosol concentration results in smaller droplet sizes which leads to larger, brighter, longer-lasting clouds that reflect more sunlight and cool the Earth. The strength of the effect is however heterogeneous, meaning it depends on the surrounding environment, making ACI one of the most uncertain effects in our current climate models. In our work, we use causal machine learning to estimate ACI from satellite observations by reframing the problem as a treatment (aerosol) and outcome (change in droplet radius). We predict the causal effect of aerosol on clouds with uncertainty bounds depending on the unknown factors that may be influencing the impact of aerosol. Of the three climate models evaluated, we find that only one plausibly recreates the trend, lending more credence to its estimate cooling due to ACI.  ( 2 min )
    Statistical whitening of neural populations with gain-modulating interneurons. (arXiv:2301.11955v1 [q-bio.NC])
    Statistical whitening transformations play a fundamental role in many computational systems, and may also play an important role in biological sensory systems. Individual neurons appear to rapidly and reversibly alter their input-output gains, approximately normalizing the variance of their responses. Populations of neurons appear to regulate their joint responses, reducing correlations between neural activities. It is natural to see whitening as the objective that guides these behaviors, but the mechanism for such joint changes is unknown, and direct adjustment of synaptic interactions would seem to be both too slow, and insufficiently reversible. Motivated by the extensive neuroscience literature on rapid gain modulation, we propose a recurrent network architecture in which joint whitening is achieved through modulation of gains within the circuit. Specifically, we derive an online statistical whitening algorithm that regulates the joint second-order statistics of a multi-dimensional input by adjusting the marginal variances of an overcomplete set of interneuron projections. The gains of these interneurons are adjusted individually, using only local signals, and feed back onto the primary neurons. The network converges to a state in which the responses of the primary neurons are whitened. We demonstrate through simulations that the behavior of the network is robust to poor conditioning or noise when the gains are sign-constrained, and can be generalized to achieve a form of local whitening in convolutional populations, such as those found throughout the visual or auditory system.  ( 2 min )
    Towards Learning Rubik's Cube with N-tuple-based Reinforcement Learning. (arXiv:2301.12167v1 [cs.LG])
    This work describes in detail how to learn and solve the Rubik's cube game (or puzzle) in the General Board Game (GBG) learning and playing framework. We cover the cube sizes 2x2x2 and 3x3x3. We describe in detail the cube's state representation, how to transform it with twists, whole-cube rotations and color transformations and explain the use of symmetries in Rubik's cube. Next, we discuss different n-tuple representations for the cube, how we train the agents by reinforcement learning and how we improve the trained agents during evaluation by MCTS wrapping. We present results for agents that learn Rubik's cube from scratch, with and without MCTS wrapping, with and without symmetries and show that both, MCTS wrapping and symmetries, increase computational costs, but lead at the same time to much better results. We can solve the 2x2x2 cube completely, and the 3x3x3 cube in the majority of the cases for scrambled cubes up to p = 15 (QTM). We cannot yet reliably solve 3x3x3 cubes with more than 15 scrambling twists. Although our computational costs are higher with MCTS wrapping and with symmetries than without, they are still considerably lower than in the approaches of McAleer et al. (2018, 2019) and Agostinelli et al. (2019) who provide the best Rubik's cube learning agents so far.  ( 2 min )
  • Open

    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v2 [stat.ML] UPDATED)
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.  ( 2 min )
    Asymptotic Inference for Multi-Stage Stationary Treatment Policy with High Dimensional Features. (arXiv:2301.12553v1 [stat.ML])
    Dynamic treatment rules or policies are a sequence of decision functions over multiple stages that are tailored to individual features. One important class of treatment policies for practice, namely multi-stage stationary treatment policies, prescribe treatment assignment probabilities using the same decision function over stages, where the decision is based on the same set of features consisting of both baseline variables (e.g., demographics) and time-evolving variables (e.g., routinely collected disease biomarkers). Although there has been extensive literature to construct valid inference for the value function associated with the dynamic treatment policies, little work has been done for the policies themselves, especially in the presence of high dimensional feature variables. We aim to fill in the gap in this work. Specifically, we first estimate the multistage stationary treatment policy based on an augmented inverse probability weighted estimator for the value function to increase the asymptotic efficiency, and further apply a penalty to select important feature variables. We then construct one-step improvement of the policy parameter estimators. Theoretically, we show that the improved estimators are asymptotically normal, even if nuisance parameters are estimated at a slow convergence rate and the dimension of the feature variables increases exponentially with the sample size. Our numerical studies demonstrate that the proposed method has satisfactory performance in small samples, and that the performance can be improved with a choice of the augmentation term that approximates the rewards or minimizes the variance of the value function.  ( 2 min )
    Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. (arXiv:2005.05409v2 [math.OC] UPDATED)
    Optimal control of diffusion processes is intimately connected to the problem of solving certain Hamilton-Jacobi-Bellman equations. Building on recent machine learning inspired approaches towards high-dimensional PDEs, we investigate the potential of $\textit{iterative diffusion optimisation}$ techniques, in particular considering applications in importance sampling and rare event simulation, and focusing on problems without diffusion control, with linearly controlled drift and running costs that depend quadratically on the control. More generally, our methods apply to nonlinear parabolic PDEs with a certain shift invariance. The choice of an appropriate loss function being a central element in the algorithmic design, we develop a principled framework based on divergences between path measures, encompassing various existing methods. Motivated by connections to forward-backward SDEs, we propose and study the novel $\textit{log-variance}$ divergence, showing favourable properties of corresponding Monte Carlo estimators. The promise of the developed approach is exemplified by a range of high-dimensional and metastable numerical examples.  ( 2 min )
    Distributed Stochastic Optimization under a General Variance Condition. (arXiv:2301.12677v1 [math.OC])
    Distributed stochastic optimization has drawn great attention recently due to its effectiveness in solving large-scale machine learning problems. However, despite that numerous algorithms have been proposed with empirical successes, their theoretical guarantees are restrictive and rely on certain boundedness conditions on the stochastic gradients, varying from uniform boundedness to the relaxed growth condition. In addition, how to characterize the data heterogeneity among the agents and its impacts on the algorithmic performance remains challenging. In light of such motivations, we revisit the classical FedAvg algorithm for solving the distributed stochastic optimization problem and establish the convergence results under only a mild variance condition on the stochastic gradients for smooth nonconvex objective functions. Almost sure convergence to a stationary point is also established under the condition. Moreover, we discuss a more informative measurement for data heterogeneity as well as its implications.  ( 2 min )
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v3 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.  ( 2 min )
    Singularity-aware Reinforcement Learning. (arXiv:2301.13152v1 [stat.ML])
    Batch reinforcement learning (RL) aims at finding an optimal policy in a dynamic environment in order to maximize the expected total rewards by leveraging pre-collected data. A fundamental challenge behind this task is the distributional mismatch between the batch data generating process and the distribution induced by target policies. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice, especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable the power of model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, STEEL only requires some minimal data-coverage assumption and thus greatly enhances the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method when facing possible singularity in batch RL.  ( 2 min )
    Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning. (arXiv:2301.12593v1 [cs.LG])
    Many real-world domains require safe decision making in the presence of uncertainty. In this work, we propose a deep reinforcement learning framework for approaching this important problem. We consider a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures, and we show that our formulation is equivalent to a distributionally robust safe reinforcement learning problem with robustness guarantees on performance and safety. We propose an efficient implementation that only requires access to a single training environment, and we demonstrate that our framework produces robust, safe performance on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.  ( 2 min )
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v2 [cs.LG] UPDATED)
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.  ( 2 min )
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v3 [cs.LG] UPDATED)
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.  ( 3 min )
    Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits. (arXiv:2211.05964v2 [stat.ML] UPDATED)
    We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling algorithm using special classes of sparsity-inducing priors (e.g., spike-and-slab) to model the unknown parameter and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high-dimensional and sparse contextual bandits. For faster computation, we use variational inference instead of Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. Extensive simulations demonstrate the improved performance of our proposed algorithm over existing ones.  ( 2 min )
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v3 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al., (2022). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of the designs in Na et al., (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.  ( 2 min )
    Generalization on the Unseen, Logic Reasoning and Degree Curriculum. (arXiv:2301.13105v1 [cs.LG])
    This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator (MDI) is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky MDIs. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.  ( 2 min )
    Mirror Sinkhorn: Fast Online Optimization on Transport Polytopes. (arXiv:2211.10420v2 [cs.LG] UPDATED)
    Optimal transport is an important tool in machine learning, allowing to capture geometric properties of the data through a linear program on transport polytopes. We present a single-loop optimization algorithm for minimizing general convex objectives on these domains, utilizing the principles of Sinkhorn matrix scaling and mirror descent. The proposed algorithm is robust to noise, and can be used in an online setting. We provide theoretical guarantees for convex objectives and experimental results showcasing it effectiveness on both synthetic and real-world data.  ( 2 min )
    SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising. (arXiv:2105.07911v2 [cs.CL] UPDATED)
    In text-to-SQL task, seq-to-seq models often lead to sub-optimal performance due to limitations in their architecture. In this paper, we present a simple yet effective approach that adapts transformer-based seq-to-seq model to robust text-to-SQL generation. Instead of inducing constraint to decoder or reformat the task as slot-filling, we propose to train seq-to-seq model with Schema aware Denoising (SeaD), which consists of two denoising objectives that train model to either recover input or predict output from two novel erosion and shuffle noises. These denoising objectives acts as the auxiliary tasks for better modeling the structural data in S2S generation. In addition, we improve and propose a clause-sensitive execution guided (EG) decoding strategy to overcome the limitation of EG decoding for generative model. The experiments show that the proposed method improves the performance of seq-to-seq model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark. The results indicate that the capacity of vanilla seq-to-seq architecture for text-to-SQL may have been under-estimated.  ( 2 min )
    Jump Interval-Learning for Individualized Decision Making. (arXiv:2111.08885v2 [stat.ME] UPDATED)
    An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.  ( 2 min )
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v3 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.  ( 2 min )
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v2 [stat.ML] UPDATED)
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders.  ( 2 min )
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v3 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 2 min )
    Online Self-Concordant and Relatively Smooth Minimization, With Applications to Online Portfolio Selection and Learning Quantum States. (arXiv:2210.00997v2 [stat.ML] UPDATED)
    Consider an online convex optimization problem where the loss functions are self-concordant barriers, smooth relative to a convex function $h$, and possibly non-Lipschitz. We analyze the regret of online mirror descent with $h$. Then, based on the result, we prove the following in a unified manner. Denote by $T$ the time horizon and $d$ the parameter dimension. 1. For online portfolio selection, the regret of $\widetilde{\text{EG}}$, a variant of exponentiated gradient due to Helmbold et al., is $\tilde{O} ( T^{2/3} d^{1/3} )$ when $T > 4 d / \log d$. This improves on the original $\tilde{O} ( T^{3/4} d^{1/2} )$ regret bound for $\widetilde{\text{EG}}$. 2. For online portfolio selection, the regret of online mirror descent with the logarithmic barrier is $\tilde{O}(\sqrt{T d})$. The regret bound is the same as that of Soft-Bayes due to Orseau et al. up to logarithmic terms. 3. For online learning quantum states with the logarithmic loss, the regret of online mirror descent with the log-determinant function is also $\tilde{O} ( \sqrt{T d} )$. Its per-iteration time is shorter than all existing algorithms we know.  ( 2 min )
    Integrating Earth Observation Data into Causal Inference: Challenges and Opportunities. (arXiv:2301.12985v1 [stat.ML])
    Observational studies require adjustment for confounding factors that are correlated with both the treatment and outcome. In the setting where the observed variables are tabular quantities such as average income in a neighborhood, tools have been developed for addressing such confounding. However, in many parts of the developing world, features about local communities may be scarce. In this context, satellite imagery can play an important role, serving as a proxy for the confounding variables otherwise unobserved. In this paper, we study confounder adjustment in this non-tabular setting, where patterns or objects found in satellite images contribute to the confounder bias. Using the evaluation of anti-poverty aid programs in Africa as our running example, we formalize the challenge of performing causal adjustment with such unstructured data -- what conditions are sufficient to identify causal effects, how to perform estimation, and how to quantify the ways in which certain aspects of the unstructured image object are most predictive of the treatment decision. Via simulation, we also explore the sensitivity of satellite image-based observational inference to image resolution and to misspecification of the image-associated confounder. Finally, we apply these tools in estimating the effect of anti-poverty interventions in African communities from satellite imagery.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v4 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.  ( 2 min )
    Data Heterogeneity Differential Privacy: From Theory to Algorithm. (arXiv:2002.08578v2 [cs.LG] UPDATED)
    Traditionally, the random noise is equally injected when training with different data instances in the field of differential privacy (DP). In this paper, we first give sharper excess risk bounds of DP stochastic gradient descent (SGD) method. Considering most of the previous methods are under convex conditions, we use Polyak-{\L}ojasiewicz condition to relax it in this paper. Then, after observing that different training data instances affect the machine learning model to different extent, we consider the heterogeneity of training data and attempt to improve the performance of DP-SGD from a new perspective. Specifically, by introducing the influence function (IF), we quantitatively measure the contributions of various training data on the final machine learning model. If the contribution made by a single data instance is so little that attackers cannot infer anything from the model, we do not add noise when training with it. Based on this observation, we design a `Performance Improving' DP-SGD algorithm: PIDP-SGD. Theoretical and experimental results show that our proposed PIDP-SGD improves the performance significantly.  ( 2 min )
    Efficient functional estimation and the super-oracle phenomenon. (arXiv:1904.09347v2 [math.ST] UPDATED)
    We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a corresponding central limit theorem, which facilitates the construction of asymptotically valid confidence intervals for the functional, having asymptotically minimal width. One interesting consequence of our results is the discovery that, for certain functionals, the worst-case performance of our estimator may improve on that of the natural `oracle' estimator, which is given access to the values of the unknown densities at the observations.  ( 2 min )
    Policy Targeting under Network Interference. (arXiv:1906.10258v13 [econ.EM] UPDATED)
    This paper studies the problem of optimally allocating treatments in the presence of spillover effects, using information from a (quasi-)experiment. I introduce a method that maximizes the sample analog of average social welfare when spillovers occur. I construct semi-parametric welfare estimators with known and unknown propensity scores and cast the optimization problem into a mixed-integer linear program, which can be solved using off-the-shelf algorithms. I derive a strong set of guarantees on regret, i.e., the difference between the maximum attainable welfare and the welfare evaluated at the estimated policy. The proposed method presents attractive features for applications: (i) it does not require network information of the target population; (ii) it exploits heterogeneity in treatment effects for targeting individuals; (iii) it does not rely on the correct specification of a particular structural model; and (iv) it accommodates constraints on the policy function. An application for targeting information on social networks illustrates the advantages of the method.  ( 2 min )
    Learning Mixtures of Markov Chains and MDPs. (arXiv:2211.09403v2 [stat.ML] UPDATED)
    We present an algorithm for learning mixtures of Markov chains and Markov decision processes (MDPs) from short unlabeled trajectories. Specifically, our method handles mixtures of Markov chains with optional control input by going through a multi-step process, involving (1) a subspace estimation step, (2) spectral clustering of trajectories using "pairwise distance estimators," along with refinement using the EM algorithm, (3) a model estimation step, and (4) a classification step for predicting labels of new trajectories. We provide end-to-end performance guarantees, where we only explicitly require the length of trajectories to be linear in the number of states and the number of trajectories to be linear in a mixing time parameter. Experimental results support these guarantees, where we attain 96.6% average accuracy on a mixture of two MDPs in gridworld, outperforming the EM algorithm with random initialization (73.2% average accuracy).  ( 2 min )
    Technical Reports Compilation: Detecting the Fire Drill Anti-pattern Using Source Code and Issue-Tracking Data. (arXiv:2104.15090v8 [cs.SE] UPDATED)
    Detecting the presence of project management anti-patterns (AP) currently requires experts on the matter and is an expensive endeavor. Worse, experts may introduce their individual subjectivity or bias. Using the Fire Drill AP, we first introduce a novel way to translate descriptions into detectable AP that are comprised of arbitrary metrics and events such as logged time or maintenance activities, which are mined from the underlying source code or issue-tracking data, thus making the description objective as it becomes data-based. Secondly, we demonstrate a novel method to quantify and score the deviations of real-world projects to data-based AP descriptions. Using fifteen real-world projects that exhibit a Fire Drill to some degree, we show how to further enhance the translated AP. The ground truth in these projects was extracted from two individual experts and consensus was found between them. We introduce a novel method called automatic calibration, that optimizes a pattern such that only necessary and important scores remain that suffice to confidently detect the degree to which the AP is present. Without automatic calibration, the proposed patterns show only weak potential for detecting the presence. Enriching the AP with data from real-world projects significantly improves the potential. We also introduce a no-pattern approach that exploits the ground truth for establishing a new, quantitative understanding of the phenomenon, as well as for finding gray-/black-box predictive models. We conclude that the presence detection and severity assessment of the Fire Drill anti-pattern, as well as some of its related and similar patterns, is certainly possible using some of the presented approaches.  ( 3 min )
    TOAST: Topological Algorithm for Singularity Tracking. (arXiv:2210.00069v2 [cs.LG] UPDATED)
    The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the 'manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.  ( 2 min )
    Factor-augmented tree ensembles. (arXiv:2111.14000v4 [stat.ML] UPDATED)
    This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.  ( 2 min )
    Transfer learning for chemically accurate interatomic neural network potentials. (arXiv:2212.03916v2 [physics.comp-ph] UPDATED)
    Developing machine learning-based interatomic potentials from ab-initio electronic structure methods remains a challenging task for computational chemistry and materials science. This work studies the capability of transfer learning, in particular discriminative fine-tuning, for efficiently generating chemically accurate interatomic neural network potentials on organic molecules from the MD17 and ANI data sets. We show that pre-training the network parameters on data obtained from density functional calculations considerably improves the sample efficiency of models trained on more accurate ab-initio data. Additionally, we show that fine-tuning with energy labels alone can suffice to obtain accurate atomic forces and run large-scale atomistic simulations, provided a well-designed fine-tuning data set. We also investigate possible limitations of transfer learning, especially regarding the design and size of the pre-training and fine-tuning data sets. Finally, we provide GM-NN potentials pre-trained and fine-tuned on the ANI-1x and ANI-1ccx data sets, which can easily be fine-tuned on and applied to organic molecules.  ( 2 min )
    Robust empirical risk minimization via Newton's method. (arXiv:2301.13192v1 [stat.ML])
    We study a variant of Newton's method for empirical risk minimization, where at each iteration of the optimization algorithm, we replace the gradient and Hessian of the objective function by robust estimators taken from existing literature on robust mean estimation for multivariate data. After proving a general theorem about the convergence of successive iterates to a small ball around the population-level minimizer, we study consequences of our theory in generalized linear models, when data are generated from Huber's epsilon-contamination model and/or heavy-tailed distributions. We also propose an algorithm for obtaining robust Newton directions based on the conjugate gradient method, which may be more appropriate for high-dimensional settings, and provide conjectures about the convergence of the resulting algorithm. Compared to the robust gradient descent algorithm proposed by Prasad et al. (2020), our algorithm enjoys the faster rates of convergence for successive iterates often achieved by second-order algorithms for convex problems, i.e., quadratic convergence in a neighborhood of the optimum, with a stepsize that may be chosen adaptively via backtracking linesearch.  ( 2 min )
    A Novel Framework for Policy Mirror Descent with General Parametrization and Linear Convergence. (arXiv:2301.13139v1 [stat.ML])
    Modern policy optimization methods in applied reinforcement learning are often inspired by the trust region policy optimization algorithm, which can be interpreted as a particular instance of policy mirror descent. While theoretical guarantees have been established for this framework, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. tabular softmax, log-linear, and neural policies. It also generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization function, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total variation of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.  ( 2 min )
    GFlowNets and variational inference. (arXiv:2210.00580v2 [cs.LG] UPDATED)
    This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.  ( 2 min )
    Selecting time-series hyperparameters with the artificial jackknife. (arXiv:2002.04697v5 [stat.ME] UPDATED)
    This article proposes a generalisation of the delete-$d$ jackknife to solve hyperparameter selection problems for time series. I call it artificial delete-$d$ jackknife to stress that this approach substitutes the classic removal step with a fictitious deletion, wherein observed datapoints are replaced with artificial missing values. This procedure keeps the data order intact and allows plain compatibility with time series. This manuscript justifies the use of this approach asymptotically and shows its finite-sample advantages through simulation studies. Besides, this article describes its real-world advantages by regulating high-dimensional forecasting models for foreign exchange rates.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v3 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    Gaussian Process Hydrodynamics. (arXiv:2209.10707v3 [physics.flu-dyn] UPDATED)
    We present a Gaussian Process (GP) approach (Gaussian Process Hydrodynamics, GPH) for approximating the solution of the Euler and Navier-Stokes equations. As in Smoothed Particle Hydrodynamics (SPH), GPH is a Lagrangian particle-based approach involving the tracking of a finite number of particles transported by the flow. However, these particles do not represent mollified particles of matter but carry discrete/partial information about the continuous flow. Closure is achieved by placing a divergence-free GP prior $\xi$ on the velocity field and conditioning on vorticity at particle locations. Known physics (e.g., the Richardson cascade and velocity-increments power laws) is incorporated into the GP prior through physics-informed additive kernels. This approach allows us to coarse-grain turbulence in a statistical manner rather than a deterministic one. By enforcing incompressibility and fluid/structure boundary conditions through the selection of the kernel, GPH requires much fewer particles than SPH. Since GPH has a natural probabilistic interpretation, numerical results come with uncertainty estimates enabling their incorporation into a UQ pipeline and the adding/removing of particles (quantas of information) in an adapted manner. The proposed approach is amenable to analysis, it inherits the complexity of state-of-the-art solvers for dense kernel matrices, and it leads to a natural definition of turbulence as information loss. Numerical experiments support the importance of selecting physics-informed kernels and illustrate the major impact of such kernels on accuracy and stability. Since the proposed approach has a Bayesian interpretation, it naturally enables data assimilation and making predictions and estimations based on mixing simulation data with experimental data.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v3 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks which, instead of considering a distribution over weights, samples outputs of each layer from a corresponding Gaussian distribution, parametrized by the predictions of mean and variance sub-layers. In uncertainty quality estimation experiments, we show that the proposed method achieves better uncertainty quality than other single-bin Bayesian Model Averaging methods, such as Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. (arXiv:2206.05794v3 [cs.LG] UPDATED)
    In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.  ( 2 min )
    Large-scale Model Personalization via Low Rank and Sparse decomposition. (arXiv:2210.03505v2 [cs.LG] UPDATED)
    Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation style systems. Standard personalization approaches involve learning a user/domain specific embedding that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain -- a.k.a meta-learning -- has high storage/infrastructure cost. We propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse matrices. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. Furthermore, the framework is up to two orders of magnitude more scalable (in terms of storage/infrastructure cost) than user-specific finetuning of model. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with iterative hard thresholding -- AMHT-LRS -- to learn the low-rank and sparse part. For the realizable, Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal samples. A significant challenge in personalization is ensuring privacy of each user's sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees. Finally, on multiple standard recommendation datasets, we demonstrate that our approach allows personalized models to obtain superior performance in sparse data regime.  ( 2 min )
    Safe and Adaptive Decision-Making for Optimization of Safety-Critical Systems: The ARTEO Algorithm. (arXiv:2211.05495v2 [cs.LG] UPDATED)
    We consider the problem of decision-making under uncertainty in an environment with safety constraints. Many business and industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown characteristics, real-time optimization becomes challenging, particularly because of the satisfaction of safety constraints. We propose the ARTEO algorithm, where we cast multi-armed bandits as a mathematical programming problem subject to safety constraints and learn the unknown characteristics through exploration while optimizing the targets. We quantify the uncertainty in unknown characteristics by using Gaussian processes and incorporate it into the cost function as a contribution which drives exploration. We adaptively control the size of this contribution in accordance with the requirements of the environment. We guarantee the safety of our algorithm with a high probability through confidence bounds constructed under the regularity assumptions of Gaussian processes. We demonstrate the safety and efficiency of our approach with two case studies: optimization of electric motor current and real-time bidding problems. We further evaluate the performance of ARTEO compared to a safe variant of upper confidence bound based algorithms. ARTEO achieves less cumulative regret with accurate and safe decisions.  ( 2 min )
    Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. (arXiv:2210.01376v2 [cs.LG] UPDATED)
    We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.  ( 2 min )
    Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods. (arXiv:2301.13006v1 [cs.LG])
    Efficient computation of the optimal transport distance between two distributions serves as an algorithm subroutine that empowers various applications. This paper develops a scalable first-order optimization-based method that computes optimal transport to within $\varepsilon$ additive accuracy with runtime $\widetilde{O}( n^2/\varepsilon)$, where $n$ denotes the dimension of the probability distributions of interest. Our algorithm achieves the state-of-the-art computational guarantees among all first-order methods, while exhibiting favorable numerical performance compared to classical algorithms like Sinkhorn and Greenkhorn. Underlying our algorithm designs are two key elements: (a) converting the original problem into a bilinear minimax problem over probability distributions; (b) exploiting the extragradient idea -- in conjunction with entropy regularization and adaptive learning rates -- to accelerate convergence.  ( 2 min )
    Refined Regret for Adversarial MDPs with Linear Function Approximation. (arXiv:2301.12942v1 [cs.LG])
    We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal O}(K^{8/9})$ regret and greatly improves over the best existing bound $\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which could again be of independent interest.  ( 2 min )
    Prediction of Customer Churn in Banking Industry. (arXiv:2301.13099v1 [stat.ML])
    With the growing competition in banking industry, banks are required to follow customer retention strategies while they are trying to increase their market share by acquiring new customers. This study compares the performance of six supervised classification techniques to suggest an efficient model to predict customer churn in banking industry, given 10 demographic and personal attributes from 10000 customers of European banks. The effect of feature selection, class imbalance, and outliers will be discussed for ANN and random forest as the two competing models. As shown, unlike random forest, ANN does not reveal any serious concern regarding overfitting and is also robust to noise. Therefore, ANN structure with five nodes in a single hidden layer is recognized as the best performing classifier.  ( 2 min )
    MixFlows: principled variational inference via mixed flows. (arXiv:2205.07475v3 [stat.ML] UPDATED)
    This work presents mixed variational flows (MixFlows), a new variational family that consists of a mixture of repeated applications of a map to an initial reference distribution. First, we provide efficient algorithms for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. We then show that MixFlows have MCMC-like convergence guarantees when the flow map is ergodic and measure-preserving, and provide bounds on the accumulation of error for practical implementations where the flow map is approximated. Finally, we develop an implementation of MixFlows based on uncorrected discretized Hamiltonian dynamics combined with deterministic momentum refreshment. Simulated and real data experiments show that MixFlows can provide more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.  ( 2 min )
    Reversible Gromov-Monge Sampler for Simulation-Based Inference. (arXiv:2109.14090v4 [stat.ME] UPDATED)
    This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d.\ samples, circumventing the usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogeneous metric measure spaces $(\cX, \mu, c_{\cX})$ and $(\cY, \nu, c_{\cY})$ from empirical data sets, with estimated maps that approximately push forward one measure $\mu$ to the other $\nu$, and vice versa. We study the analytic properties of the RGM distance and derive that under mild conditions, RGM equals the classic Gromov-Wasserstein distance. Curiously, drawing a connection to Brenier's polar factorization, we show that the RGM sampler induces bias towards strong isomorphism with proper choices of $c_{\cX}$ and $c_{\cY}$. Statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.  ( 2 min )
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond. (arXiv:2203.07835v3 [cs.LG] UPDATED)
    With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.  ( 2 min )
    Accelerating Kernel Classifiers Through Borders Mapping. (arXiv:1708.05917v6 [stat.ML] UPDATED)
    Support vector machines (SVM) and other kernel techniques represent a family of powerful statistical classification methods with high accuracy and broad applicability. Because they use all or a significant portion of the training data, however, they can be slow, especially for large problems. Piecewise linear classifiers are similarly versatile, yet have the additional advantages of simplicity, ease of interpretation and, if the number of component linear classifiers is not too large, speed. Here we show how a simple, piecewise linear classifier can be trained from a kernel-based classifier in order to improve the classification speed. The method works by finding the root of the difference in conditional probabilities between pairs of opposite classes to build up a representation of the decision boundary. When tested on 17 different datasets, it succeeded in improving the classification speed of a SVM for 12 of them by up to two orders-of-magnitude. Of these, two were less accurate than a simple, linear classifier. The method is best suited to problems with continuum features data and smooth probability functions. Because the component linear classifiers are built up individually from an existing classifier, rather than through a simultaneous optimization procedure, the classifier is also fast to train.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces. (arXiv:2301.13088v1 [stat.ME])
    Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 2 min )
    On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation. (arXiv:1910.08412v3 [cs.LG] UPDATED)
    Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.  ( 2 min )
    Benchmarking optimality of time series classification methods in distinguishing diffusions. (arXiv:2301.13112v1 [stat.ML])
    Performance benchmarking is a crucial component of time series classification (TSC) algorithm design, and a fast-growing number of datasets have been established for empirical benchmarking. However, the empirical benchmarks are costly and do not guarantee statistical optimality. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is optimal in the sense of the Neyman-Pearson lemma: it has the smallest false positive rate among classifiers with a controlled level of false negative rate. The LRT requires the likelihood ratio of the time series to be computable. The diffusion processes from stochastic differential equations provide such time series and are flexible in design for generating linear or nonlinear time series. We demonstrate the benchmarking with three scalable state-of-the-art TSC algorithms: random forest, ResNet, and ROCKET. Test results show that they can achieve LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying nonlinear multivariate time series from high-dimensional stochastic interacting particle systems. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series. Thus, the LRT with diffusion processes can systematically and efficiently benchmark the optimality of TSC algorithms and may guide their future improvements.  ( 2 min )
    Revisiting Over-smoothing and Over-squashing using Ollivier-Ricci Curvature. (arXiv:2211.15779v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature, while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.  ( 2 min )
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v4 [cs.CV] UPDATED)
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    On student-teacher deviations in distillation: does it pay to disobey?. (arXiv:2301.12923v1 [cs.LG])
    Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must somehow deviate from the teacher (Stanton et al., 2021). But what is the nature of these deviations, and how do they relate to gains in generalization? We investigate these questions through a series of experiments across image and language classification datasets. First, we observe that distillation consistently deviates in a characteristic way: on points where the teacher has low confidence, the student achieves even lower confidence than the teacher. Secondly, we find that deviations in the initial dynamics of training are not crucial -- simply switching to distillation loss in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives to understand the role of student-teacher deviations in our experiments, one casting distillation as a regularizer in eigenspace, and another as a gradient denoiser. Our analysis bridges several gaps between existing theory and practice by (a) focusing on gradient-descent training, (b) by avoiding label noise assumptions, and (c) by unifying several disjoint empirical and theoretical findings.  ( 2 min )
    Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. (arXiv:2301.12930v1 [cs.LG])
    Location-scale noise models (LSNMs) are a class of heteroscedastic structural causal models with wide applicability, closely related to affine flow models. Recent likelihood-based methods designed for LSNMs that infer cause-effect relationships achieve state-of-the-art accuracy, when their assumptions are satisfied concerning the noise distributions. However, under misspecification their accuracy deteriorates sharply, especially when the conditional variance in the anti-causal direction is smaller than that in the causal direction. In this paper, we demonstrate the misspecification problem and analyze why and when it occurs. We show that residual independence testing is much more robust to misspecification than likelihood-based cause-effect inference. Our empirical evaluation includes 580 synthetic and 99 real-world datasets.  ( 2 min )
    Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v1 [cs.LG])
    Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property of graphs, and has recently started to prove useful in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with cutting-edge methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.  ( 2 min )
    Massively Scaling Heteroscedastic Classifiers. (arXiv:2301.12860v1 [cs.LG])
    Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In addition heteroscedastic classifiers introduce a critical temperature hyperparameter which must be tuned. We propose HET-XL, a heteroscedastic classifier whose parameter count when compared to a standard classifier scales independently of the number of classes. In our large-scale settings, we show that we can remove the need to tune the temperature hyperparameter, by directly learning it on the training data. On large image classification datasets with up to 4B images and 30k classes our method requires 14X fewer additional parameters, does not require tuning the temperature on a held-out set and performs consistently better than the baseline heteroscedastic classifier. HET-XL improves ImageNet 0-shot classification in a multimodal contrastive learning setup which can be viewed as a 3.5 billion class classification problem.  ( 2 min )
    How Powerful are Shallow Neural Networks with Bandlimited Random Weights?. (arXiv:2008.08427v2 [cs.LG] UPDATED)
    We investigate the expressive power of depth-2 bandlimited random neural networks. A random net is a neural network where the hidden layer parameters are frozen with random assignment, and only the output layer parameters are trained by loss minimization. Using random weights for a hidden layer is an effective method to avoid non-convex optimization in standard gradient descent learning. It has also been adopted in recent deep learning theories. Despite the well-known fact that a neural network is a universal approximator, in this study, we mathematically show that when hidden parameters are distributed in a bounded domain, the network may not achieve zero approximation error. In particular, we derive a new nontrivial approximation error lower bound. The proof utilizes the technique of ridgelet analysis, a harmonic analysis method designed for neural networks. This method is inspired by fundamental principles in classical signal processing, specifically the idea that signals with limited bandwidth may not always be able to perfectly recreate the original signal. We corroborate our theoretical results with various simulation studies, and generally, two main take-home messages are offered: (i) Not any distribution for selecting random weights is feasible to build a universal approximator; (ii) A suitable assignment of random weights exists but to some degree is associated with the complexity of the target function.  ( 2 min )
    Fair and Optimal Classification via Post-Processing Predictors. (arXiv:2211.01528v2 [cs.LG] UPDATED)
    To address the bias exhibited by machine learning models, fairness criteria impose statistical constraints for ensuring equal treatment to all demographic groups, but typically at a cost to model performance. Understanding this tradeoff, therefore, underlies the design of fair and effective algorithms. This paper completes the characterization of the inherent tradeoff of demographic parity on classification problems in the most general multigroup, multiclass, and noisy setting. Specifically, we show that the minimum error rate is given by the optimal value of a Wasserstein-barycenter problem. More practically, this reformulation leads to a simple procedure for post-processing any pre-trained predictors to satisfy demographic parity in the general setting, which, in particular, yields the optimal fair classifier when applied to the Bayes predictor. We provide suboptimality and finite sample analyses for our procedure, and demonstrate precise control of the tradeoff of error rate for fairness on real-world datasets provided sufficient data.  ( 2 min )
    Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation. (arXiv:2301.13087v1 [cs.LG])
    We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.Our algorithm obtains an $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.  ( 2 min )
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v3 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.  ( 2 min )
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v3 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed.  ( 2 min )
    Likelihood-Free Frequentist Inference: Confidence Sets with Correct Conditional Coverage. (arXiv:2107.03920v5 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, particularly outside asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage for small sample sizes. This paper unifies classical statistics with modern machine learning to present (i) a practical procedure for the Neyman construction of confidence sets with finite-sample guarantees of nominal coverage, and (ii) diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can leverage the LF2I machinery to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our paper discusses the benefits and challenges of LF2I, with a breakdown of the sources of errors in LF2I confidence sets.  ( 2 min )
    Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation. (arXiv:2208.12401v3 [cs.LG] UPDATED)
    Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework.  ( 2 min )
    Interpolating between BSDEs and PINNs: deep learning for elliptic and parabolic boundary value problems. (arXiv:2112.03749v2 [math.NA] UPDATED)
    Solving high-dimensional partial differential equations is a recurrent challenge in economics, science and engineering. In recent years, a great number of computational approaches have been developed, most of them relying on a combination of Monte Carlo sampling and deep learning based approximation. For elliptic and parabolic problems, existing methods can broadly be classified into those resting on reformulations in terms of $\textit{backward stochastic differential equations}$ (BSDEs) and those aiming to minimize a regression-type $L^2$-error ($\textit{physics-informed neural networks}$, PINNs). In this paper, we review the literature and suggest a methodology based on the novel $\textit{diffusion loss}$ that interpolates between BSDEs and PINNs. Our contribution opens the door towards a unified understanding of numerical approaches for high-dimensional PDEs, as well as for implementations that combine the strengths of BSDEs and PINNs. The diffusion loss furthermore bears close similarities to $\textit{(least squares) temporal difference}$ objectives found in reinforcement learning. We also discuss eigenvalue problems and perform extensive numerical studies, including calculations of the ground state for nonlinear Schr\"odinger operators and committor functions relevant in molecular dynamics.  ( 2 min )
    A theory of continuous generative flow networks. (arXiv:2301.12594v1 [cs.LG])
    Generative flow networks (GFlowNets) are amortized variational inference algorithms that are trained to sample from unnormalized target distributions over compositional objects. A key limitation of GFlowNets until this time has been that they are restricted to discrete spaces. We present a theory for generalized GFlowNets, which encompasses both existing discrete GFlowNets and ones with continuous or hybrid state spaces, and perform experiments with two goals in mind. First, we illustrate critical points of the theory and the importance of various assumptions. Second, we empirically demonstrate how observations about discrete GFlowNets transfer to the continuous case and show strong results compared to non-GFlowNet baselines on several previously studied tasks. This work greatly widens the perspectives for the application of GFlowNets in probabilistic inference and various modeling settings.  ( 2 min )
    Intrinsic Bayesian Optimisation on Complex Constrained Domain. (arXiv:2301.12581v1 [stat.ML])
    Motivated by the success of Bayesian optimisation algorithms in the Euclidean space, we propose a novel approach to construct Intrinsic Bayesian optimisation (In-BO) on manifolds with a primary focus on complex constrained domains or irregular-shaped spaces arising as submanifolds of R2, R3 and beyond. Data may be collected in a spatial domain but restricted to a complex or intricately structured region corresponding to a geographic feature, such as lakes. Traditional Bayesian Optimisation (Tra-BO) defined with a Radial basis function (RBF) kernel cannot accommodate these complex constrained conditions. The In-BO uses the Sparse Intrinsic Gaussian Processes (SIn-GP) surrogate model to take into account the geometric structure of the manifold. SInGPs are constructed using the heat kernel of the manifold which is estimated as the transition density of the Brownian Motion on manifolds. The efficiency of In-BO is demonstrated through simulation studies on a U-shaped domain, a Bitten torus, and a real dataset from the Aral sea. Its performance is compared to that of traditional BO, which is defined in Euclidean space.  ( 2 min )
    Are Random Decompositions all we need in High Dimensional Bayesian Optimisation?. (arXiv:2301.12844v1 [cs.LG])
    Learning decompositions of expensive-to-evaluate black-box functions promises to scale Bayesian optimisation (BO) to high-dimensional problems. However, the success of these techniques depends on finding proper decompositions that accurately represent the black-box. While previous works learn those decompositions based on data, we investigate data-independent decomposition sampling rules in this paper. We find that data-driven learners of decompositions can be easily misled towards local decompositions that do not hold globally across the search space. Then, we formally show that a random tree-based decomposition sampler exhibits favourable theoretical guarantees that effectively trade off maximal information gain and functional mismatch between the actual black-box and its surrogate as provided by the decomposition. Those results motivate the development of the random decomposition upper-confidence bound algorithm (RDUCB) that is straightforward to implement - (almost) plug-and-play - and, surprisingly, yields significant empirical gains compared to the previous state-of-the-art on a comprehensive set of benchmarks. We also confirm the plug-and-play nature of our modelling component by integrating our method with HEBO, showing improved practical gains in the highest dimensional tasks from Bayesmark.  ( 2 min )
    PAC-Bayesian Soft Actor-Critic Learning. (arXiv:2301.12776v1 [cs.LG])
    Actor-critic algorithms address the dual goals of reinforcement learning, policy evaluation and improvement, via two separate function approximators. The practicality of this approach comes at the expense of training instability, caused mainly by the destructive effect of the approximation errors of the critic on the actor. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm. We further demonstrate that the online learning performance improves significantly when a stochastic actor explores multiple futures by critic-guided random search. We observe our resulting algorithm to compare favorably to the state of the art on multiple classical control and locomotion tasks in both sample efficiency and asymptotic performance.  ( 2 min )
    Bagging Provides Assumption-free Stability. (arXiv:2301.12600v1 [stat.ML])
    Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model with bounded outputs. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant.  ( 2 min )
    Imbalanced Mixed Linear Regression. (arXiv:2301.12559v1 [stat.ML])
    We consider the problem of mixed linear regression (MLR), where each observed sample belongs to one of $K$ unknown linear models. In practical applications, the proportions of the $K$ components are often imbalanced. Unfortunately, most MLR methods do not perform well in such settings. Motivated by this practical challenge, in this work we propose Mix-IRLS, a novel, simple and fast algorithm for MLR with excellent performance on both balanced and imbalanced mixtures. In contrast to popular approaches that recover the $K$ models simultaneously, Mix-IRLS does it sequentially using tools from robust regression. Empirically, Mix-IRLS succeeds in a broad range of settings where other methods fail. These include imbalanced mixtures, small sample sizes, presence of outliers, and an unknown number of models $K$. In addition, Mix-IRLS outperforms competing methods on several real-world datasets, in some cases by a large margin. We complement our empirical results by deriving a recovery guarantee for Mix-IRLS, which highlights its advantage on imbalanced mixtures.  ( 2 min )
    Compression, Generalization and Learning. (arXiv:2301.12767v1 [cs.LG])
    A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassification, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (called the "risk"). We identify conditions under which the cardinality of the compressed set is a consistent estimator for the risk (without any upper limit on the size of the compressed set) and prove unprecedentedly tight bounds to evaluate the risk under a generally applicable condition of preference. All results are usable in a fully agnostic setup, without requiring any a priori knowledge on the probability distribution of the observations. Not only these results offer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning.  ( 2 min )
    On Second-Order Scoring Rules for Epistemic Uncertainty Quantification. (arXiv:2301.12736v1 [cs.LG])
    It is well known that accurate probabilistic predictors can be trained through empirical risk minimisation with proper scoring rules as loss functions. While such learners capture so-called aleatoric uncertainty of predictions, various machine learning methods have recently been developed with the goal to let the learner also represent its epistemic uncertainty, i.e., the uncertainty caused by a lack of knowledge and data. An emerging branch of the literature proposes the use of a second-order learner that provides predictions in terms of distributions on probability distributions. However, recent work has revealed serious theoretical shortcomings for second-order predictors based on loss minimisation. In this paper, we generalise these findings and prove a more fundamental result: There seems to be no loss function that provides an incentive for a second-order learner to faithfully represent its epistemic uncertainty in the same manner as proper scoring rules do for standard (first-order) learners. As a main mathematical tool to prove this result, we introduce the generalised notion of second-order scoring rules.  ( 2 min )
    Machine Learning with High-Cardinality Categorical Features in Actuarial Applications. (arXiv:2301.12710v1 [stat.ML])
    High-cardinality categorical features are pervasive in actuarial data (e.g. occupation in commercial property insurance). Standard categorical encoding methods like one-hot encoding are inadequate in these settings. In this work, we present a novel _Generalised Linear Mixed Model Neural Network_ ("GLMMNet") approach to the modelling of high-cardinality categorical features. The GLMMNet integrates a generalised linear mixed model in a deep learning framework, offering the predictive power of neural networks and the transparency of random effects estimates, the latter of which cannot be obtained from the entity embedding models. Further, its flexibility to deal with any distribution in the exponential dispersion (ED) family makes it widely applicable to many actuarial contexts and beyond. We illustrate and compare the GLMMNet against existing approaches in a range of simulation experiments as well as in a real-life insurance case study. Notably, we find that the GLMMNet often outperforms or at least performs comparably with an entity embedded neural network, while providing the additional benefit of transparency, which is particularly valuable in practical applications. Importantly, while our model was motivated by actuarial applications, it can have wider applicability. The GLMMNet would suit any applications that involve high-cardinality categorical variables and where the response cannot be sufficiently modelled by a Gaussian distribution.  ( 2 min )
    Kernelized Cumulants: Beyond Kernel Mean Embeddings. (arXiv:2301.12466v1 [stat.ML])
    In $\mathbb R^d$, it is well-known that cumulants provide an alternative to moments that can achieve the same goals with numerous benefits such as lower variance estimators. In this paper we extend cumulants to reproducing kernel Hilbert spaces (RKHS) using tools from tensor algebras and show that they are computationally tractable by a kernel trick. These kernelized cumulants provide a new set of all-purpose statistics; the classical maximum mean discrepancy and Hilbert-Schmidt independence criterion arise as the degree one objects in our general construction. We argue both theoretically and empirically (on synthetic, environmental, and traffic data analysis) that going beyond degree one has several advantages and can be achieved with the same computational complexity and minimal overhead in our experiments.  ( 2 min )
    Implicit Regularization for Group Sparsity. (arXiv:2301.12540v1 [stat.ML])
    We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a diagonally grouped linear neural network. We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments.  ( 2 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models. (arXiv:2301.12334v1 [cs.LG])
    We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating sufficient numbers of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves highly ineffective and time-consuming for the task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first provide a new insight on the majority-focused nature of the diffusion models: they denoise in favor of the majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating the low-likelihood minority samples over existing generative frameworks including the standard diffusion sampler.  ( 2 min )
    On Enhancing Expressive Power via Compositions of Single Fixed-Size ReLU Network. (arXiv:2301.12353v1 [cs.LG])
    This paper studies the expressive power of deep neural networks from the perspective of function compositions. We show that repeated compositions of a single fixed-size ReLU network can produce super expressive power. In particular, we prove by construction that $\mathcal{L}_2\circ \boldsymbol{g}^{\circ r}\circ \boldsymbol{\mathcal{L}}_1$ can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(r^{-1/d})$, where $\boldsymbol{g}$ is realized by a fixed-size ReLU network, $\boldsymbol{\mathcal{L}}_1$ and $\mathcal{L}_2$ are two affine linear maps matching the dimensions, and $\boldsymbol{g}^{\circ r}$ means the $r$-times composition of $\boldsymbol{g}$. Furthermore, we extend such a result to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Our results reveal that a continuous-depth network generated via a dynamical system has good approximation power even if its dynamics function is time-independent and realized by a fixed-size ReLU network.  ( 2 min )
    3D Object Detection in LiDAR Point Clouds using Graph Neural Networks. (arXiv:2301.12519v1 [cs.CV])
    LiDAR (Light Detection and Ranging) is an advanced active remote sensing technique working on the principle of time of travel (ToT) for capturing highly accurate 3D information of the surroundings. LiDAR has gained wide attention in research and development with the LiDAR industry expected to reach 2.8 billion $ by 2025. Although the LiDAR dataset is of rich density and high spatial resolution, it is challenging to process LiDAR data due to its inherent 3D geometry and massive volume. But such a high-resolution dataset possesses immense potential in many applications and has great potential in 3D object detection and recognition. In this research we propose Graph Neural Network (GNN) based framework to learn and identify the objects in the 3D LiDAR point clouds. GNNs are class of deep learning which learns the patterns and objects based on the principle of graph learning which have shown success in various 3D computer vision tasks.  ( 2 min )
    SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits. (arXiv:2301.12357v1 [stat.ML])
    In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal behavior policy and derive its regret with respect to the optimal behavior policy. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.  ( 2 min )
    On Heterogeneous Treatment Effects in Heterogeneous Causal Graphs. (arXiv:2301.12383v1 [stat.ME])
    Heterogeneity and comorbidity are two interwoven challenges associated with various healthcare problems that greatly hampered research on developing effective treatment and understanding of the underlying neurobiological mechanism. Very few studies have been conducted to investigate heterogeneous causal effects (HCEs) in graphical contexts due to the lack of statistical methods. To characterize this heterogeneity, we first conceptualize heterogeneous causal graphs (HCGs) by generalizing the causal graphical model with confounder-based interactions and multiple mediators. Such confounders with an interaction with the treatment are known as moderators. This allows us to flexibly produce HCGs given different moderators and explicitly characterize HCEs from the treatment or potential mediators on the outcome. We establish the theoretical forms of HCEs and derive their properties at the individual level in both linear and nonlinear models. An interactive structural learning is developed to estimate the complex HCGs and HCEs with confidence intervals provided. Our method is empirically justified by extensive simulations and its practical usefulness is illustrated by exploring causality among psychiatric disorders for trauma survivors.  ( 2 min )
    On Learning Necessary and Sufficient Causal Graphs. (arXiv:2301.12389v1 [cs.LG])
    The causal revolution has spurred interest in understanding complex relationships in various fields. Most existing methods aim to discover causal relationships among all variables in a large-scale complex graph. However, in practice, only a small number of variables in the graph are relevant for the outcomes of interest. As a result, causal estimation with the full causal graph -- especially given limited data -- could lead to many falsely discovered, spurious variables that may be highly correlated with but have no causal impact on the target outcome. In this paper, we propose to learn a class of necessary and sufficient causal graphs (NSCG) that only contains causally relevant variables for an outcome of interest, which we term causal features. The key idea is to utilize probabilities of causation to systematically evaluate the importance of features in the causal graph, allowing us to identify a subgraph that is relevant to the outcome of interest. To learn NSCG from data, we develop a score-based necessary and sufficient causal structural learning (NSCSL) algorithm, by establishing theoretical relationships between probabilities of causation and causal effects of features. Across empirical studies of simulated and real data, we show that the proposed NSCSL algorithm outperforms existing algorithms and can reveal important yeast genes for target heritable traits of interest.  ( 2 min )
    Multi-task Highly Adaptive Lasso. (arXiv:2301.12029v1 [stat.ML])
    We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm attains a powerful dimension-free convergence rate of $o_p(n^{-1/4})$ or better. We show that MT-HAL outperforms sparsity-based MTL competitors across a wide range of simulation studies, including settings with nonlinear and linear relationships, varying levels of sparsity and task correlations, and different numbers of covariates and sample size.  ( 2 min )
    Variational Latent Branching Model for Off-Policy Evaluation. (arXiv:2301.12056v1 [cs.LG])
    Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.  ( 2 min )
    Decentralized Entropic Optimal Transport for Privacy-preserving Distributed Distribution Comparison. (arXiv:2301.12065v1 [cs.LG])
    Privacy-preserving distributed distribution comparison measures the distance between the distributions whose data are scattered across different agents in a distributed system and cannot be shared among the agents. In this study, we propose a novel decentralized entropic optimal transport (EOT) method, which provides a privacy-preserving and communication-efficient solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the decentralized EOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a distributed kernel approximation method, and each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. We analyze our method's communication complexity and provide a theoretical bound for the approximation error caused by the convergence error, the approximated kernel, and the mismatch between the storage and communication protocols. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.  ( 2 min )
    Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. (arXiv:2301.12083v1 [cs.LG])
    Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.  ( 2 min )
    Inference on the Optimal Assortment in the Multinomial Logit Model. (arXiv:2301.12254v1 [stat.ML])
    Assortment optimization has received active explorations in the past few decades due to its practical importance. Despite the extensive literature dealing with optimization algorithms and latent score estimation, uncertainty quantification for the optimal assortment still needs to be explored and is of great practical significance. Instead of estimating and recovering the complete optimal offer set, decision makers may only be interested in testing whether a given property holds true for the optimal assortment, such as whether they should include several products of interest in the optimal set, or how many categories of products the optimal set should include. This paper proposes a novel inferential framework for testing such properties. We consider the widely adopted multinomial logit (MNL) model, where we assume that each customer will purchase an item within the offered products with a probability proportional to the underlying preference score associated with the product. We reduce inferring a general optimal assortment property to quantifying the uncertainty associated with the sign change point detection of the marginal revenue gaps. We show the asymptotic normality of the marginal revenue gap estimator, and construct a maximum statistic via the gap estimators to detect the sign change point. By approximating the distribution of the maximum statistic with multiplier bootstrap techniques, we propose a valid testing procedure. We also conduct numerical experiments to assess the performance of our method.  ( 2 min )
    ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts. (arXiv:2301.12171v1 [cs.CV])
    Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.  ( 2 min )
    STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning. (arXiv:2301.12038v1 [cs.LG])
    Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.  ( 2 min )
    Minimizing Trajectory Curvature of ODE-based Generative Models. (arXiv:2301.12003v1 [cs.LG])
    Recent ODE/SDE-based generative models, such as diffusion models and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at https://github.com/sangyun884/fast-ode.  ( 2 min )
    Quantum Ridgelet Transform: Winning Lottery Ticket of Neural Networks with Quantum Computation. (arXiv:2301.11936v1 [quant-ph])
    Ridgelet transform has been a fundamental mathematical tool in the theoretical studies of neural networks. However, the practical applicability of ridgelet transform to conducting learning tasks was limited since its numerical implementation by conventional classical computation requires an exponential runtime $\exp(O(D))$ as data dimension $D$ increases. To address this problem, we develop a quantum ridgelet transform (QRT), which implements the ridgelet transform of a quantum state within a linear runtime $O(D)$ of quantum computation. As an application, we also show that one can use QRT as a fundamental subroutine for quantum machine learning (QML) to efficiently find a sparse trainable subnetwork of large shallow wide neural networks without conducting large-scale optimization of the original network. This application discovers an efficient way in this regime to demonstrate the lottery ticket hypothesis on finding such a sparse trainable neural network. These results open an avenue of QML for accelerating learning tasks with commonly used classical neural networks.  ( 2 min )
    Alignment with human representations supports robust few-shot learning. (arXiv:2301.11990v1 [cs.LG])
    Should we care whether AI systems have representations of the world that are similar to those of humans? We provide an information-theoretic analysis that suggests that there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks. We confirm this prediction empirically, finding such a relationship in an analysis of the performance of 491 computer vision models. We also show that highly-aligned models are more robust to both adversarial attacks and domain shifts. Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.  ( 2 min )
    Leveraging Importance Weights in Subset Selection. (arXiv:2301.12052v1 [cs.LG])
    We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.  ( 2 min )
    Reduced-Order Autodifferentiable Ensemble Kalman Filters. (arXiv:2301.11961v1 [stat.ML])
    This paper introduces a computational framework to reconstruct and forecast a partially observed state that evolves according to an unknown or expensive-to-simulate dynamical system. Our reduced-order autodifferentiable ensemble Kalman filters (ROAD-EnKFs) learn a latent low-dimensional surrogate model for the dynamics and a decoder that maps from the latent space to the state space. The learned dynamics and decoder are then used within an ensemble Kalman filter to reconstruct and forecast the state. Numerical experiments show that if the state dynamics exhibit a hidden low-dimensional structure, ROAD-EnKFs achieve higher accuracy at lower computational cost compared to existing methods. If such structure is not expressed in the latent state dynamics, ROAD-EnKFs achieve similar accuracy at lower cost, making them a promising approach for surrogate state reconstruction and forecasting.  ( 2 min )

  • Open

    [R] Are there any Machine Learning Journals that accept Viewpoint Papers (~1500+ words)?
    Basically the title. I have a sequence of two papers - a viewpoint and a complete paper in the works - that I'm looking to submit, the viewpoint outlining the theoretical premise for the latter. I've currently had no luck finding any ML-specific journals that allow viewpoint submissions (with the exception of simply posting to arXiv), and was wondering if anyone here was familiar with any. Thanks :D submitted by /u/Adi-Dewan [link] [comments]  ( 42 min )
    [P] Python wrapper of [ New AI classifier for indicating AI-written text from openai tool ]
    Openai is developing a new tool to help distinguish between AI-written and human-written text. Here is an unofficial python wrapper of openai model to detect if a text is written by #chatgpt , #gpt3 , #gpt etc Github: https://github.com/promptslab/openai-detector https://preview.redd.it/f45ggu45tgfa1.png?width=1122&format=png&auto=webp&s=4cb5ae70d7194cc3c070f3ad2dcbc968a804d4a3 submitted by /u/StoicBatman [link] [comments]  ( 42 min )
    Introducing NoRef-ER: A Multi-Language Referenceless ASR Metric (on HuggingFace) [R] [P]
    I am proud to announce the release of NoRefER, a multi-language referenceless ASR metric based on a fine-tuned language model, for public use on HuggingFace. This metric allows for evaluating the outputs of ASR models without needing a reference transcript, making it a valuable tool for a/b testing multiple ASR models or model versions, or even ensembling their outputs. ASR is an important technology with various applications, but the quality of ASR systems can vary greatly. It's important to accurately evaluate and compare the performance of different ASR models, traditionally done using reference-based ASR quality evaluation metrics. However, obtaining those ground-truth transcriptions from human annotators is time-consuming and costly. Referenceless quality evaluation is becoming impo…  ( 44 min )
    [N] Vincent Warmerdam: Calmcode, Explosion, Open Source and Data Science | Learning From Machine Learning #2
    https://www.youtube.com/watch?v=yvgxRzqx1Jg ​ Contents 00:00 Learning from Machine Learning Intro 00:21 Vincent Warmerdam Intro 01:18 Career Journey 03:25 What roles have you played? 05:44 Academic Background: Operations Research and Design 06:52 Operations Research 08:13 Mathematics 09:19 What attracted you to Machine Learning? 10:40 Calmcode 14:08 Calmcode, Do you use it? 15:22 Calmdcode, *args, **kwargs 16:23 If there were no constraints, what would you do to improve calmcode? 18:10 Open Source Projects: bulk, embetter, human-learn 19:10 Open Source: evol, scikit-lego 20:00 Rasa: Chatbots, Benchmarking 20:47 Unit Tests 21:42 Open Source: Creating Packages 24:10 Bulk, human-learn 26:20 27:03 Bulk in a notebook, bulk as a webapp 27:45 Human in the loop 29:03 Understanding the problem; Beans, Beef and Bread 32:56 Algorithm on the wrong problem 34:55 Module Improvement vs System Improvement 37:20 Does your answer make sense? 39:04 What's an important question that you believe remains unanswered in ML? 41:48 How do you view the gap between the hype and reality of AI? 46:28 Generative Models vs. Predictive Models 49:18 Jumping to solutions 50:08 Model vs. System 50:48 51:10 Who has influenced you in the field? 55:18 Humble, Caring Presenters 56:38 What's one piece of advice that you've received that's helped you? 01:00:18 Advice for people just starting in the field 01:03:15 What has a career in machine learning taught you about life? 01:05:16 SpaCy 01:06:10 Data-Centric Approach 01:06:50 Wrap-up 01:07:15 Follow, Explosion 01:07:48 Outro submitted by /u/NLPnerd [link] [comments]  ( 43 min )
    [D] Generative Model FOr Facts Extraction
    Is it possible to finetune a generative model (like T5) to do something like this: { inputs: "XYZ XYZ was born in ABC. They now live in DEF.", targets: "XYZ born in ABC XYZ lives in DEF" } Like the transformer model fom this paper if so how should I go about approaching the problem? Is this task as simple as feeding it the inputs and targets or do you guys think it has more to it? submitted by /u/Zetsu-Eiyu-O [link] [comments]  ( 43 min )
    [D] Open-source auto-ml services
    What're some good open-source auto-ml services mainly for image classification that're similar to Google's Vertex-AI submitted by /u/binaryshrey [link] [comments]  ( 42 min )
    [N] Monitor OpenAI API Latency, Tokens, Rate Limits, and More with Graphsignal
    Relying on hosted inference with LLMs in productions, such as via OpenAI API, has some challenges. The use of APIs should be designed around unstable latency, rate limits, token counts, costs, etc. To make it observable we've built tracing and monitoring specifically for AI apps. For example, the OpenAI Python library is monitored automatically, no need to do anything. We'll be adding support for more libraries. Here is a blog post with more info and screenshots: Monitor OpenAI API Latency, Tokens, Rate Limits, and More. And the GitHub repo. submitted by /u/l0g1cs [link] [comments]  ( 42 min )
    [P] Fine Tuning Whisper in another language
    Hi all, I'm trying to fine-tune Whisper AI to transcribe albanian speech to text but I have a problem in that I don't know how the dataset for training whisper model should look like. I already have voice audios and the transcript for that audio file but I need to know how to reformat it into a valid dataset for training Whisper. Thanks in advance! submitted by /u/ruizard [link] [comments]  ( 43 min )
    [D] Have researchers given up on traditional machine learning methods?
    This may be a silly question for those familiar with the field, but don't machine learning researchers expect any more prospects for traditional methods (I mean, "traditional" is other than deep learning)? I feel that most of the time when people talk about machine learning in the world today, they are referring to deep learning, but is this the same in the academic world? Have people who have been studying traditional methods switched to neural networks? I know that many researchers are excited about deep learning, but I am wondering what they think about other methods. submitted by /u/fujidaiti [link] [comments]  ( 52 min )
  • Open

    Snoop Dogg Giving a speech about Zombies | AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Is it possible to find a specific group of layers where vanishing gradients is starting and cascading ?
    I was curious to know if there are any particular layers which could be contributing to gradients vanishing and these vanishing gradients cascading down to subsequent lower layers, how to find these layers and if there has been any research work done on it ? submitted by /u/V1bicycle [link] [comments]  ( 41 min )
    Feature selection
    Hello, hoping I can get a simple question answered. How are features selected for neural nets where each instance in time has too much data to take in? Let’s say a 4K movie is going to play and we want to know if the next 3 seconds will contain X. We also don’t have a computer strong enough to take in all data at all times. Let’s say 20% of data is maxing our system. I have watched a ton of videos about how layers work but how does the NN take samples to put those samples into its filter? Is it up to the programmer to find clever ways of filtering down the data into certain “indicators” or something? submitted by /u/Joebone87 [link] [comments]  ( 41 min )
    Biologically plausible neural networks?
    Biologically plausible neural networks? So I'm wondering if you could possibly design ANN to simulate action potentials of neurons, and train them to do various tasks. I would hope that you can accurately simulate specific patterns of neural activity, when running a task. I'm sure this has already been done, but I'm wondering how big of a task it is to accomplish. Thanks! submitted by /u/daddydilly694-20 [link] [comments]  ( 42 min )
    Recurrent neural network in python (keras) error: ValueError: `logits` and `labels` must have the same shape, received ((None, 90, 1) vs (None,))
    I'm developing a recurrent neural network in python using keras to do binary classification on roulette wheel data. I'm trying to compile my code but it's crashing, could you help me fix the code please? Here is my code from keras.models import Sequential from keras.layers import Dense, Dropout from sklearn.preprocessing import MinMaxScaler import numpy as np import pandas as pd columns = ['data', 'resultado'] base = pd.read_csv("blaze_values_27_01_2023_VERMELHO_1.csv", header = None, names = columns) base = base.dropna() base_treinamento = base.iloc[:, 1:2] normalizador = MinMaxScaler(feature_range=[0,1]) base_treinamento_normalizada = normalizador.fit_transform(base_treinamento) previsores = [] saida_real = [] for i in range (90,1809): previsores.append(base_treinamento_normalizada[i-9…  ( 42 min )
  • Open

    Snoop Dogg Giving a speech about Zombies | AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    "Accurate and Explainable Image-based Prediction Using a Lightweight Generative Model"
    submitted by /u/pasticciociccio [link] [comments]  ( 40 min )
    100 000 cells simulated, how many will we need to form something that looks like consciousness?
    submitted by /u/blob_evol_sim [link] [comments]  ( 41 min )
    NeuralShare - make any AI model accessible for free from any device, regardless of its hardware resources
    Hello humans! A project is about to start (https://github.com/neuralshare) which will allow any device, even low-cpu, to use and interact with AI models that they could not use since they require so many hardware resources. The project is called NeuralShare and will use the Stellar network to achieve its goal, as you can read on the README on Github at the link on this post. More specifically, it will use the Futurenet network and to achieve its purpose it will use the NEUR token, also on Futurenet, which was already created a little while ago and has no real value, has no speculative objective but only has one purpose, namely that of the project. We plan to distribute this token as an airdrop, totally free of course, and anyone who has this token will be able to use and interact with GPT-3 or other AI models for free simply by sending a NEUR transaction with a text memo attached containing the prompt. In return you will receive back a transaction with a text memo containing the response. However, this will only be possible when there are enough nodes (go to github to understand why),the precise number of nodes that must be active before the response can be received is still to be defined . For more details and updates you can add yourself to Discord, find the links on the README file on GitHub, as well as more details on the functioning of this method which will also allow you to use GPT-3 for free and without an api-key submitted by /u/0ut0flin3 [link] [comments]  ( 43 min )
    AI Related Newsletter
    I do not even remember signing up for an AI newsletter but I got an issue today from “AINow” that was actually pretty insightful/informative. Thought I’d share it as I see frequent posts asking where ppl get their info/news from. Newsletter Here submitted by /u/iwjahshehbs [link] [comments]  ( 40 min )
    AI Still Feels Artificial. What Are We Missing?
    submitted by /u/jrowley [link] [comments]  ( 40 min )
    Stable Diffusion + Dream Fusion + Text-to-Motion. This animation has been made in 5 minutes with the AI-Game Development platform I'm building. No coding or design skills needed, just text prompt engineering. Assets exportable in Unity. Seeking alpha testers
    submitted by /u/SpeaKrLipSync [link] [comments]  ( 41 min )
    This ended super convincing: O Captain! My Captain by Benedict Cumberbatch (ElevenLabs)
    submitted by /u/citizentim [link] [comments]  ( 41 min )
    OpenAI releases AI text detector for ChatGPT and other models
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    Anthropic's Claude: Ex-OpenAI Employees Launches ChatGPT Rival
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    Which are the opensource SOTA for voice conversion and/or voice cloning for the Indic languages ?
    Lot of ok-ish voice cloning and conversion tools are available in market, but most have an American English tone to them. What are the opensource SOTA for voice conversion and/or voice cloning for the Indic languages ? submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    [Searchcolab] BREAKING!! And update to ChatGPT just launched.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    What tech jobs will be safe from AI over the next decade?
    I'm studying computer networking but I'm certain that AI can easily set up virtual networks and cloud computing solutions on their own. Should I have followed my dream to become a bricklayer? submitted by /u/black_linux_guy [link] [comments]  ( 41 min )
    AI Music Videos
    Does anybody know how people on Instagram and TikTok are creating these AI music videos? I see them everywhere but they gatekeep what ever they’re using. submitted by /u/mynameisbob1011 [link] [comments]  ( 40 min )
    📌[Searchcolab] "Gotham during Recession" Link in comments.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Where do you go for news updates and insights? Not Including Here
    Hello all! ​ I am new to this subreddit, but I have begun the fascinating descent into learning about all things AI and its ilk. I would like your help. ​ I want to turn on the firehose and consume all the content I can. ​ - What are your favorite news sources and media creators? ​ - How do you find the latest tools and research papers? ​ - How do you see where the conversation is going and stay updated with new opinions? ​ This would be incredibly helpful, and I hope I posted right. submitted by /u/WaffleHouseBaby [link] [comments]  ( 41 min )
    AI Certifications
    Hi All -- I've been in IT for about 15 years. I started as a sysadmin, segued into dev work, and have been specializing in systems & data integration and business process automation for the past 5ish years. I also studied Computer Science for 4 years but never graduated (that's a whole other sobstory -- won't bore anyone with that) I've been considering the direction I want to go for the next leg of my career and I feel like AI/ML is the logical next step. I've started building a portfolio of my AI/ML projects and I'd like to pick up a cert or two to compliment them. My ultimate goal is solutions architecture, but I love the engineering side of things and expect to start there. What I want to know is what certifications everyone would recommend -- I've been eyeing the IBM-sponsored certificate tracks on Coursera but I've seen a few others and they all have their merits. I'm just not sure how much value they carry in the job market. Are there any worth staying well away from? submitted by /u/am_i_the_rabbit [link] [comments]  ( 41 min )
    DHT(BitTorrent) network that replaces websites/dns
    Soon we will have very powerful assistants.They can interact with databases directly.E.g. you ask for news articles from BBC.In the case of assistants they don't need human readable url.They can sql query database on 192.168.1.1 with ease.And then they present the data in appropriate format.And dht is just for discovery. submitted by /u/nikitastaf1996 [link] [comments]  ( 41 min )
    ChatGPT Could Destroy Google In A Few Years, According To Gmail’s Creator
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Princeton computer science professor says don’t panic over ‘bullshit generator’ ChatGPT
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    A tool to run GPT3 responses on your offline data and custom data sources
    At DocuStack, our goal is to help organisations turn their internal knowledge into an answering bot, saving time and increasing efficiency for teams. If you're familiar with the challenges of managing internal knowledge and answering repetitive questions, I believe you will find DocuStack to be a valuable tool. With features such as a searchable database, customisable answers, multi-language support, and offline file upload capabilities, DocuStack makes it easy to manage your internal knowledge and provide quick, accurate answers to your team. Here is the link https://www.docustack.ai/chat This is our first launch of a wider product for customer support. would love to hear your thoughts and do not forget join the waitlist, if you would want to try the beta version. submitted by /u/titansaurabh [link] [comments]  ( 41 min )
    Science and Engineering Fair Project regarding AI
    Hello everyone. I'm passionate about AI and just computer science in general. I would like to participate in a Science and Engineering Fair, but I want to have an idea of what I'm going to do first. What can I do involving AI and do very well in the fair? If not AI, then computer science in general. Thank you submitted by /u/Sufficient-Ad-8881 [link] [comments]  ( 42 min )
    The generative AI revolution has begun—how did we get here? | "A new class of incredibly powerful AI models has made recent breakthroughs possible."
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    [D] I have been working in UI development for around 12 years having worked in pretty much everything that is there in UI. But lately I feel AI is most definitely something that has the max scope and potential in the future. So my question is how do I get into the field of AI? what would be roadmap?
    submitted by /u/anuratya [link] [comments]  ( 43 min )
  • Open

    Learning an action using PPO reinforcement learning that is also a negative reward?
    I’m doing RL on a problem where I learn 2 actions and my reward = action 1 - action 2. Since action 2 is getting subtracted, the agent learns to output 0.0 value for action 2 (action space for both the actions is between 0.0 and 0.1). Can someone please advice how can I make the agent explore non-zero values for action 2. submitted by /u/HonestScratch1827 [link] [comments]  ( 41 min )
    Multi-Agent RL for Ranged Army Combat Micro-Management (Like Dragon PvP Fight in StarCraft)
    I would like to invite interested people to collaborate on this hobby project of mine. This is still in an early-stage, and I believe it can be significantly improved together. The GitHub repository link is here: https://github.com/kayuksel/multi-rl-crowd-sim Note: The difference from StarCraft is that Dragons can hide behind each other. They also reduce their strength of hitting, propotional to decrease of their health. https://preview.redd.it/wrpcaz782dfa1.png?width=640&format=png&auto=webp&s=1dede69acb78e874a80bd532af85b269c7117f9f submitted by /u/k_yuksel [link] [comments]  ( 41 min )
    Autotuned temperature for SAC
    Has anyone ever monitored the behavior of alpha in autotuned SAC ? I implemented it and it seems to work, but I would be interested in seeing a commented graph of the evolution of alpha during the learning process, and I could not find a contribution including this. submitted by /u/Scrimbibete [link] [comments]  ( 41 min )
    Odd Reward behavior
    Hi all, I'm training an Agent (to control a platform to maintain attitude) but I'm having problems understanding the following behavior: R = A - penalty I thought adding 1.0 would increase the cumulative reward but that's not the case. R1 = A - penalty + 1.0 R1 ends up being less than R. ​ In light of this, I multiplied penalty by 10 to see what happens: R2 = A - 10.0*penalty This, increases cumulative reward (R2 > R). ​ Note that 'A' and 'penalty' are always positive values. Any idea what this means (and how to go about shaping R)? submitted by /u/XecutionStyle [link] [comments]  ( 46 min )
  • Open

    Cyberpunk 2077 Brings a Taste of the Future With DLSS
    Analyst reports. Academic papers. Ph.D. programs. There are a lot of places you can go to get a glimpse of the future. But the best place might just be El Coyote Cojo, a whiskey-soaked dive bar that doesn’t exist in real life. Fire up Cyberpunk 2077 and you’ll see much more than the watering hole’s Read article >  ( 6 min )
    Broadcaster ‘Nilson1489’ Shares Livestreaming Techniques and More This Week ‘In the NVIDIA Studio’
    Broadcasters have an arsenal of new features and technologies at their disposal; the eighth-generation NVIDIA video encoder on RTX 40 Series GPUs with support for the open AV1 video-coding format; new NVIDIA Broadcast app effects like Eye Contact and Vignette; and support for AV1 streaming in Discord.  ( 7 min )
  • Open

    New AI classifier for indicating AI-written text
    We’re launching a classifier trained to distinguish between AI-written and human-written text. We’ve trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe  ( 3 min )
  • Open

    DSC Weekly 31 January 2023 – Data Models for the Weather
    Announcements Data Models for the Weather With January coming to an end, we here in the Northeast let out a collective sigh of relief as the month ends without any major snowstorms that tend to happen in the first month of the year. Weather forecasting is a centuries-old practice that has its roots in divination… Read More »DSC Weekly 31 January 2023 – Data Models for the Weather The post DSC Weekly 31 January 2023 – Data Models for the Weather appeared first on Data Science Central.  ( 19 min )
    Explaining FAIR Data to Aunt Doris
    I’m sure you’ve run into this situation yourself. You’re at a family gathering, and someone at the table asks you exactly what you do for a living. Maybe it’s your uncle, a grandparent, or a child. You try to describe in simple terms what you do, but they get a mystified expression on their face.… Read More »Explaining FAIR Data to Aunt Doris The post Explaining FAIR Data to Aunt Doris appeared first on Data Science Central.  ( 21 min )
    Java in Cloud Native Environment: All You Need To Know
    Java has been a prevalent programming language. Even today, it remains one of the top three most-used languages for developing enterprise software. New cloud-native Java runtimes must provide developers with the following four significant benefits. It helps build cloud-native, microservices, and serverless Java applications: Traditional Java applications run as containers on hardware servers that control… Read More »Java in Cloud Native Environment: All You Need To Know The post Java in Cloud Native Environment: All You Need To Know appeared first on Data Science Central.  ( 20 min )
    NIST Artificial Intelligence Risk Management Framework
    The National Institute of Standards and Technology (NIST) has released it’s Artificial Intelligence Risk Management Framework (AI RMF 1.0), a guidance document for voluntary use by organizations designing, developing, deploying, or using AI systems to help manage the many risks of AI technologies. The NIST AI Risk standards provide a practical and bipartisan perspective for… Read More »NIST Artificial Intelligence Risk Management Framework The post NIST Artificial Intelligence Risk Management Framework appeared first on Data Science Central.  ( 19 min )
    Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II
    In the previous article, we looked at two Ever-Successful NFL teams, the Kansas City Chiefs and the San Francisco 49ers, who seem to be able to win consistently even while things change around them and players and coaches come and go.  Then, we looked at two Never-Successful teams, the Arizona Cardinals and the Cleveland Browns,… Read More »Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II The post Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II appeared first on Data Science Central.  ( 26 min )
  • Open

    Avoid having to integrate by parts twice
    Suppose f(x) and g(x) are functions that are each proportional to their second derivative. These include exponential, circular, and hyperbolic functions. Then the integral of f(x) g(x) can be computed in closed form with a moderate amount of work. The first time you see how such integrals are computed, it’s an interesting trick. I wrote […] Avoid having to integrate by parts twice first appeared on John D. Cook.  ( 5 min )
  • Open

    Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion. (arXiv:2301.11757v1 [cs.CL])
    The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media synthesis. One area that has yet to be fully explored is the application of diffusion models to music generation. Music generation requires to handle multiple aspects, including the temporal dimension, long-term structure, multiple layers of overlapping sounds, and nuances that only trained listeners can detect. In our work, we investigate the potential of diffusion models for text-conditional music generation. We develop a cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU. In addition to trained models, we provide a collection of open-source libraries with the hope of facilitating future work in the field. We open-source the following: - Music samples for this paper: https://bit.ly/anonymous-mousai - All music samples for all models: https://bit.ly/audio-diffusion - Codes: https://github.com/archinetai/audio-diffusion-pytorch  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v2 [cs.LG] UPDATED)
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Input Perturbation Reduces Exposure Bias in Diffusion Models. (arXiv:2301.11706v1 [cs.LG])
    Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the \textbf{exposure bias} problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that the proposed input perturbation leads to a significant improvement of the sample quality while reducing both the training and the inference times. For instance, on CelebA 64$\times$64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time.  ( 2 min )
    Learning Visual Representations for Transfer Learning by Suppressing Texture. (arXiv:2011.01901v3 [cs.CV] UPDATED)
    Recent literature has shown that features obtained from supervised training of CNNs may over-emphasize texture rather than encoding high-level information. In self-supervised learning in particular, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations. To address these problems we propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture. This simple method helps retain important edge information and suppress texture at the same time. We empirically show that our method achieves state-of-the-art results on object detection and image classification with eight diverse datasets in either supervised or self-supervised learning tasks such as MoCoV2 and Jigsaw. Our method is particularly effective for transfer learning tasks and we observed improved performance on five standard transfer learning datasets. The large improvements (up to 11.49\%) on the Sketch-ImageNet dataset, DTD dataset and additional visual analyses with saliency maps suggest that our approach helps in learning better representations that better transfer.  ( 2 min )
    DBGSL: Dynamic Brain Graph Structure Learning. (arXiv:2209.13513v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs) have shown success at learning representations of brain graphs derived from functional magnetic resonance imaging (fMRI) data. The majority of existing GNN methods, however, assume brain graphs are static over time and the graph adjacency matrix is known prior to model training. These assumptions are at odds with neuroscientific evidence that brain graphs are time-varying with a connectivity structure that depends on the choice of functional connectivity measure. Noisy brain graphs that do not truly represent the underling fMRI data can have a detrimental impact on the performance of GNNs. As a solution, we propose Dynamic Brain Graph Structure Learning (DBGSL), a novel method for learning the optimal time-varying dependency structure of fMRI data induced by a downstream prediction task. Experiments demonstrate DBGSL achieves state-of-the-art performance for sex classification using real-world resting-state and task fMRI data. Moreover, analysis of the learnt dynamic graphs highlights prediction-related brain regions which align with existing neuroscience literature.  ( 2 min )
    Normality-Guided Distributional Reinforcement Learning for Continuous Control. (arXiv:2208.13125v2 [cs.LG] UPDATED)
    Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) methods instead model the value distribution, which has been shown to improve performance in many settings. In this paper, we model the value distribution as approximately normal using the Markov Chain central limit theorem. We analytically compute quantile bars to provide a new DRL target that is informed by the decrease in standard deviation that occurs over the course of an episode. In addition, we propose a policy update strategy based on uncertainty as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds and show that our methods produce performance improvements in continuous control tasks.  ( 2 min )
    Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits. (arXiv:2210.00025v2 [cs.LG] UPDATED)
    How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean. (arXiv:2301.11862v1 [stat.ML])
    Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models.  ( 2 min )
    Challenging Common Assumptions in Convex Reinforcement Learning. (arXiv:2202.01511v3 [cs.LG] UPDATED)
    The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.  ( 2 min )
    Bayesian Self-Supervised Contrastive Learning. (arXiv:2301.11673v1 [cs.LG])
    Recent years have witnessed many successful applications of contrastive learning in diverse domains, yet its self-supervised version still remains many exciting challenges. As the negative samples are drawn from unlabeled datasets, a randomly selected sample may be actually a false negative to an anchor, leading to incorrect encoder training. This paper proposes a new self-supervised contrastive loss called the BCL loss that still uses random samples from the unlabeled data while correcting the resulting bias with importance weights. The key idea is to design the desired sampling distribution for sampling hard true negative samples under the Bayesian framework. The prominent advantage lies in that the desired sampling distribution is a parametric structure, with a location parameter for debiasing false negative and concentration parameter for mining hard negative, respectively. Experiments validate the effectiveness and superiority of the BCL loss.  ( 2 min )
    Rethinking Assumptions in Deep Anomaly Detection. (arXiv:2006.00339v3 [cs.LG] UPDATED)
    Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) it is usually treated in an unsupervised manner since one typically does not have access to, or it is infeasible to utilize, a dataset that sufficiently characterizes what it means to be "anomalous." In this paper we present results demonstrating that this intuition surprisingly seems not to extend to deep AD on images. For a recent AD benchmark on ImageNet, classifiers trained to discern between normal samples and just a few (64) random natural images are able to outperform the current state of the art in deep AD. Experimentally we discover that the multiscale structure of image data makes example anomalies exceptionally informative.  ( 2 min )
    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v2 [cs.LG] UPDATED)
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v3 [stat.ML] UPDATED)
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.  ( 2 min )
    Can We Faithfully Represent Absence States to Compute Shapley Values on a DNN?. (arXiv:2105.10719v3 [cs.LG] UPDATED)
    Although many methods have been proposed to estimate attributions of input variables, there still exists a significant theoretical flaw in masking-based attribution methods, i.e., it is hard to examine whether the masking method faithfully represents the absence of input variables. Specifically, for masking-based attributions, setting an input variable to the baseline value is a typical way of representing the absence of the variable. However, there are no studies investigating how to represent the absence of input variables and verify the faithfulness of baseline values. Therefore, we revisit the feature representation of a DNN in terms of causality, and propose to use causal patterns to examine whether the masking method faithfully removes information encoded in input variables. More crucially, it is proven that the causality can be explained as the elementary rationale of the Shapley value. Furthermore, we define the optimal baseline value from the perspective of causality, and we propose a method to learn the optimal baseline value. Experimental results have demonstrated the effectiveness of our method.  ( 2 min )
    Theoretical Analysis of Offline Imitation With Supplementary Dataset. (arXiv:2301.11687v1 [cs.LG])
    Behavioral cloning (BC) can recover a good policy from abundant expert data, but may fail when expert data is insufficient. This paper considers a situation where, besides the small amount of expert data, a supplementary dataset is available, which can be collected cheaply from sub-optimal policies. Imitation learning with a supplementary dataset is an emergent practical framework, but its theoretical foundation remains under-developed. To advance understanding, we first investigate a direct extension of BC, called NBCU, that learns from the union of all available data. Our analysis shows that, although NBCU suffers an imitation gap that is larger than BC in the worst case, there exist special cases where NBCU performs better than or equally well as BC. This discovery implies that noisy data can also be helpful if utilized elaborately. Therefore, we further introduce a discriminator-based importance sampling technique to re-weight the supplementary data, proposing the WBCU method. With our newly developed landscape-based analysis, we prove that WBCU can outperform BC in mild conditions. Empirical studies show that WBCU simultaneously achieves the best performance on two challenging tasks where prior state-of-the-art methods fail.  ( 2 min )
    Interpreting learning in biological neural networks as zero-order optimization method. (arXiv:2301.11777v1 [cs.LG])
    Recently, significant progress has been made regarding the statistical understanding of artificial neural networks (ANNs). ANNs are motivated by the functioning of the brain, but differ in several crucial aspects. In particular, it is biologically implausible that the learning of the brain is based on gradient descent. In this work we look at the brain as a statistical method for supervised learning. The main contribution is to relate the local updating rule of the connection parameters in biological neural networks (BNNs) to a zero-order optimization method.  ( 2 min )
    Reinforcement Learning from Diverse Human Preferences. (arXiv:2301.11774v1 [cs.LG])
    The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v2 [cs.LG] UPDATED)
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v3 [cs.LG] UPDATED)
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.
    CADet: Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning. (arXiv:2210.01742v2 [cs.LG] UPDATED)
    Handling out-of-distribution (OOD) samples has become a major stake in the real-world deployment of machine learning systems. This work explores the application of self-supervised contrastive learning to the simultaneous detection of two types of OOD samples: unseen classes and adversarial perturbations. Since in practice the distribution of such samples is not known in advance, we do not assume access to OOD examples. We first show that similarity functions trained with contrastive learning can be leveraged with the maximum mean discrepancy (MMD) two-sample test to verify whether two independent sets of samples are drawn from the same distribution. Inspired by this approach, we introduce CADet (Contrastive Anomaly Detection), a method based on contrastive transformations to perform anomaly detection on single samples. CADet compares favorably to adversarial detection methods to detect adversarially perturbed samples on ImageNet. Simultaneously, it achieves comparable performance to unseen label detection methods on two challenging benchmarks: ImageNet-O and iNaturalist. CADet is fully self-supervised and requires neither labels for in-distribution samples nor access to OOD examples.
    Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization. (arXiv:2212.10445v2 [cs.LG] UPDATED)
    Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: from a pre-trained foundation model, they fine-tune the weights on the target task of interest. So, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks: these individual fine-tunings exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain rich and diverse features. In this paper, we thus propose model ratatouille, a new strategy to recycle the multiple fine-tunings of the same foundation model on diverse auxiliary tasks. Specifically, we repurpose these auxiliary weights as initializations for multiple parallel fine-tunings on the target task; then, we average all fine-tuned weights to obtain the final model. This recycling strategy aims at maximizing the diversity in weights by leveraging the diversity in auxiliary tasks. Empirically, it improves the state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, this work contributes to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to reliably update machine learning models.
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v2 [stat.ML] UPDATED)
    We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.
    Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks. (arXiv:2301.11546v1 [cs.LG])
    Learning adversarial examples can be formulated as an optimization problem of maximizing the loss function with some box-constraints. However, for solving this induced optimization problem, the state-of-the-art gradient-based methods such as FGSM, I-FGSM and MI-FGSM look different from their original methods especially in updating the direction, which makes it difficult to understand them and then leaves some theoretical issues to be addressed in viewpoint of optimization. In this paper, from the perspective of adapting step-size, we provide a unified theoretical interpretation of these gradient-based adversarial learning methods. We show that each of these algorithms is in fact a specific reformulation of their original gradient methods but using the step-size rules with only current gradient information. Motivated by such analysis, we present a broad class of adaptive gradient-based algorithms based on the regular gradient methods, in which the step-size strategy utilizing information of the accumulated gradients is integrated. Such adaptive step-size strategies directly normalize the scale of the gradients rather than use some empirical operations. The important benefit is that convergence for the iterative algorithms is guaranteed and then the whole optimization process can be stabilized. The experiments demonstrate that our AdaI-FGM consistently outperforms I-FGSM and AdaMI-FGM remains competitive with MI-FGSM for black-box attacks.
    Synthetic A/B Testing using Synthetic Interventions. (arXiv:2006.07691v5 [econ.EM] UPDATED)
    Suppose there are $N$ units and $D$ interventions. We aim to learn the average potential outcome associated with every unit-intervention pair, i.e., $N \times D$ causal parameters. While running $N \times D$ experiments is conceivable, it can be expensive or infeasible. This work introduces an experiment design, synthetic A/B testing, and the synthetic interventions (SI) estimator to recover all $N \times D$ causal parameters while observing each unit under at most two interventions, independent of $D$. Under a novel tensor factor model for potential outcomes across units, measurements, and interventions, we establish the identification of each parameter. Further, we show the SI estimator is finite-sample consistent and asymptotically normal. Collectively, these also lead to novel results for panel data settings, particularly for synthetic controls. We empirically validate our experiment design using real e-commerce data from a large-scale A/B test.
    Uplink Scheduling in Federated Learning: an Importance-Aware Approach via Graph Representation Learning. (arXiv:2301.11903v1 [cs.NI])
    Federated Learning (FL) has emerged as a promising framework for distributed training of AI-based services, applications, and network procedures in 6G. One of the major challenges affecting the performance and efficiency of 6G wireless FL systems is the massive scheduling of user devices over resource-constrained channels. In this work, we argue that the uplink scheduling of FL client devices is a problem with a rich relational structure. To address this challenge, we propose a novel, energy-efficient, and importance-aware metric for client scheduling in FL applications by leveraging Unsupervised Graph Representation Learning (UGRL). Our proposed approach introduces a relational inductive bias in the scheduling process and does not require the collection of training feedback information from client devices, unlike state-of-the-art importance-aware mechanisms. We evaluate our proposed solution against baseline scheduling algorithms based on recently proposed metrics in the literature. Results show that, when considering scenarios of nodes exhibiting spatial relations, our approach can achieve an average gain of up to 10% in model accuracy and up to 17 times in energy efficiency compared to state-of-the-art importance-aware policies.
    Incorporating Background Knowledge in Symbolic Regression using a Computer Algebra System. (arXiv:2301.11919v1 [cs.LG])
    Symbolic Regression (SR) can generate interpretable, concise expressions that fit a given dataset, allowing for more human understanding of the structure than black-box approaches. The addition of background knowledge (in the form of symbolic mathematical constraints) allows for the generation of expressions that are meaningful with respect to theory while also being consistent with data. We specifically examine the addition of constraints to traditional genetic algorithm (GA) based SR (PySR) as well as a Markov-chain Monte Carlo (MCMC) based Bayesian SR architecture (Bayesian Machine Scientist), and apply these to rediscovering adsorption equations from experimental, historical datasets. We find that, while hard constraints prevent GA and MCMC SR from searching, soft constraints can lead to improved performance both in terms of search effectiveness and model meaningfulness, with computational costs increasing by about an order-of-magnitude. If the constraints do not correlate well with the dataset or expected models, they can hinder the search of expressions. We find Bayesian SR is better these constraints (as the Bayesian prior) than by modifying the fitness function in the GA
    Synth-by-Reg (SbR): Contrastive learning for synthesis-based registration of paired images. (arXiv:2107.14449v3 [cs.CV] UPDATED)
    Nonlinear inter-modality registration is often challenging due to the lack of objective functions that are good proxies for alignment. Here we propose a synthesis-by-registration method to convert this problem into an easier intra-modality task. We introduce a registration loss for weakly supervised image translation between domains that does not require perfectly aligned training data. This loss capitalises on a registration U-Net with frozen weights, to drive a synthesis CNN towards the desired translation. We complement this loss with a structure preserving constraint based on contrastive learning, which prevents blurring and content shifts due to overfitting. We apply this method to the registration of histological sections to MRI slices, a key step in 3D histology reconstruction. Results on two different public datasets show improvements over registration based on mutual information (13% reduction in landmark error) and synthesis-based algorithms such as CycleGAN (11% reduction), and are comparable to a registration CNN with label supervision. Code and data are publicly available at \url{https://github.com/acasamitjana/SynthByReg}
    Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees. (arXiv:2301.11911v1 [cs.LG])
    The completeness axiom renders the explanation of a post-hoc XAI method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. Recently, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. This paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.
    Communication-Efficient Learning of Deep Networks from Decentralized Data. (arXiv:1602.05629v4 [cs.LG] UPDATED)
    Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.
    OccRob: Efficient SMT-Based Occlusion Robustness Verification of Deep Neural Networks. (arXiv:2301.11912v1 [cs.LG])
    Occlusion is a prevalent and easily realizable semantic perturbation to deep neural networks (DNNs). It can fool a DNN into misclassifying an input image by occluding some segments, possibly resulting in severe errors. Therefore, DNNs planted in safety-critical systems should be verified to be robust against occlusions prior to deployment. However, most existing robustness verification approaches for DNNs are focused on non-semantic perturbations and are not suited to the occlusion case. In this paper, we propose the first efficient, SMT-based approach for formally verifying the occlusion robustness of DNNs. We formulate the occlusion robustness verification problem and prove it is NP-complete. Then, we devise a novel approach for encoding occlusions as a part of neural networks and introduce two acceleration techniques so that the extended neural networks can be efficiently verified using off-the-shelf, SMT-based neural network verification tools. We implement our approach in a prototype called OccRob and extensively evaluate its performance on benchmark datasets with various occlusion variants. The experimental results demonstrate our approach's effectiveness and efficiency in verifying DNNs' robustness against various occlusions, and its ability to generate counterexamples when these DNNs are not robust.
    Projected Subnetworks Scale Adaptation. (arXiv:2301.11487v1 [cs.LG])
    Large models support great zero-shot and few-shot capabilities. However, updating these models on new tasks can break performance on previous seen tasks and their zero/few-shot unseen tasks. Our work explores how to update zero/few-shot learners such that they can maintain performance on seen/unseen tasks of previous tasks as well as new tasks. By manipulating the parameter updates of a gradient-based meta learner as the projected task-specific subnetworks, we show improvements for large models to retain seen and zero/few shot task performance in online settings.
    Streaming LifeLong Learning With Any-Time Inference. (arXiv:2301.11892v1 [cs.LG])
    Despite rapid advancements in lifelong learning (LLL) research, a large body of research mainly focuses on improving the performance in the existing \textit{static} continual learning (CL) setups. These methods lack the ability to succeed in a rapidly changing \textit{dynamic} environment, where an AI agent needs to quickly learn new instances in a `single pass' from the non-i.i.d (also possibly temporally contiguous/coherent) data streams without suffering from catastrophic forgetting. For practical applicability, we propose a novel lifelong learning approach, which is streaming, i.e., a single input sample arrives in each time step, single pass, class-incremental, and subject to be evaluated at any moment. To address this challenging setup and various evaluation protocols, we propose a Bayesian framework, that enables fast parameter update, given a single training example, and enables any-time inference. We additionally propose an implicit regularizer in the form of snap-shot self-distillation, which effectively minimizes the forgetting further. We further propose an effective method that efficiently selects a subset of samples for online memory rehearsal and employs a new replay buffer management scheme that significantly boosts the overall performance. Our empirical evaluations and ablations demonstrate that the proposed method outperforms the prior works by large margins.
    Graph-Free Learning in Graph-Structured Data: A More Efficient and Accurate Spatiotemporal Learning Perspective. (arXiv:2301.11742v1 [cs.LG])
    Spatiotemporal learning, which aims at extracting spatiotemporal correlations from the collected spatiotemporal data, is a research hotspot in recent years. And considering the inherent graph structure of spatiotemporal data, recent works focus on capturing spatial dependencies by utilizing Graph Convolutional Networks (GCNs) to aggregate vertex features with the guidance of adjacency matrices. In this paper, with extensive and deep-going experiments, we comprehensively analyze existing spatiotemporal graph learning models and reveal that extracting adjacency matrices with carefully design strategies, which are viewed as the key of enhancing performance on graph learning, are largely ineffective. Meanwhile, based on these experiments, we also discover that the aggregation itself is more important than the way that how vertices are aggregated. With these preliminary, a novel efficient Graph-Free Spatial (GFS) learning module based on layer normalization for capturing spatial correlations in spatiotemporal graph learning. The proposed GFS module can be easily plugged into existing models for replacing all graph convolution components. Rigorous theoretical proof demonstrates that the time complexity of GFS is significantly better than that of graph convolution operation. Extensive experiments verify the superiority of GFS in both the perspectives of efficiency and learning effect in processing graph-structured data especially extreme large scale graph data.  ( 2 min )
    ActiveLab: Active Learning with Re-Labeling by Multiple Annotators. (arXiv:2301.11856v1 [cs.LG])
    In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods.
    Element selection for functional materials discovery by integrated machine learning of elemental contributions to properties. (arXiv:2202.01051v2 [cond-mat.mtrl-sci] UPDATED)
    Fundamental differences between materials originate from the unique nature of their constituent chemical elements. Before specific differences emerge according to the precise ratios of elements in a given crystal structure, a material can be represented by the set of its constituent chemical elements. By working at the level of the periodic table, assessment of materials at the level of their phase fields reduces the combinatorial complexity to accelerate screening, and circumvents the challenges associated with composition-level approaches such as poor extrapolation within phase fields, and the impossibility of exhaustive sampling. This early stage discrimination combined with evaluation of novelty of phase fields aligns with the outstanding experimental challenge of identifying new areas of chemistry to investigate, by prioritising which elements to combine in a reaction. Here, we demonstrate that phase fields can be assessed with respect to the maximum expected value of a target functional property and ranked according to chemical novelty. We develop and present PhaseSelect, an end-to-end machine learning model that combines the representation, classification, regression and ranking of phase fields. First, PhaseSelect constructs elemental characteristics from the co-occurrence of chemical elements in computationally and experimentally reported materials, then it employs attention mechanisms to learn representation for phase fields and assess their functional performance. At the level of the periodic table, PhaseSelect quantifies the probability of observing a functional property, estimates its value within a phase field and also ranks a phase field novelty, which we demonstrate with significant accuracy for three avenues of materials applications for high-temperature superconductivity, high-temperature magnetism, and targeted bandgap energy.
    Is TinyML Sustainable? Assessing the Environmental Impacts of Machine Learning on Microcontrollers. (arXiv:2301.11899v1 [cs.LG])
    The sustained growth of carbon emissions and global waste elicits significant sustainability concerns for our environment's future. The growing Internet of Things (IoT) has the potential to exacerbate this issue. However, an emerging area known as Tiny Machine Learning (TinyML) has the opportunity to help address these environmental challenges through sustainable computing practices. TinyML, the deployment of machine learning (ML) algorithms onto low-cost, low-power microcontroller systems, enables on-device sensor analytics that unlocks numerous always-on ML applications. This article discusses the potential of these TinyML applications to address critical sustainability challenges. Moreover, the footprint of this emerging technology is assessed through a complete life cycle analysis of TinyML systems. From this analysis, TinyML presents opportunities to offset its carbon emissions by enabling applications that reduce the emissions of other sectors. Nevertheless, when globally scaled, the carbon footprint of TinyML systems is not negligible, necessitating that designers factor in environmental impact when formulating new devices. Finally, research directions for enabling further opportunities for TinyML to contribute to a sustainable future are outlined.
    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient. (arXiv:2301.11913v1 [cs.DC])
    Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.
    Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning. (arXiv:2301.11857v1 [cs.AI])
    Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.
    Automatic Modulation Classification with Deep Neural Networks. (arXiv:2301.11773v1 [cs.LG])
    Automatic modulation classification is a desired feature in many modern software-defined radios. In recent years, a number of convolutional deep learning architectures have been proposed for automatically classifying the modulation used on observed signal bursts. However, a comprehensive analysis of these differing architectures and importance of each design element has not been carried out. Thus it is unclear what tradeoffs the differing designs of these convolutional neural networks might have. In this research, we investigate numerous architectures for automatic modulation classification and perform a comprehensive ablation study to investigate the impacts of varying hyperparameters and design elements on automatic modulation classification performance. We show that a new state of the art in performance can be achieved using a subset of the studied design elements. In particular, we show that a combination of dilated convolutions, statistics pooling, and squeeze-and-excitation units results in the strongest performing classifier. We further investigate this best performer according to various other criteria, including short signal bursts, common misclassifications, and performance across differing modulation categories and modes.  ( 2 min )
    Conformal inference is (almost) free for neural networks trained with early stopping. (arXiv:2301.11556v1 [stat.ML])
    Early stopping based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stopping often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses the above limitation with conformalized early stopping: a novel method that combines early stopping with conformal calibration while efficiently recycling the same hold-out data. This leads to models that are both accurate and able to provide exact predictive inferences without multiple data splits nor overly conservative adjustments. Practical implementations are developed for different learning tasks -- outlier detection, multi-class classification, regression -- and their competitive performance is demonstrated on real data.
    I Prefer not to Say: Are Users Penalized for Protecting Personal Data?. (arXiv:2210.13954v3 [cs.LG] UPDATED)
    We examine the problem of obtaining fair outcomes for individuals who choose to share optional information with machine-learned models and those who do not consent and keep their data undisclosed. We find that these non-consenting users receive significantly lower prediction outcomes than justified by their provided information alone. This observation gives rise to the overlooked problem of how to ensure that users, who protect their personal data, are not penalized. While statistical fairness notions focus on fair outcomes between advantaged and disadvantaged groups, these fairness notions fail to protect the non-consenting users. To address this problem, we formalize protection requirements for models which (i) allow users to benefit from sharing optional information and (ii) do not penalize them if they keep their data undisclosed. We offer the first solution to this problem by proposing the notion of Optional Feature Fairness (OFF), which we prove to be loss-optimal under our protection requirements (i) and (ii). To learn OFF-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we extensively analyze OFF on a variety of challenging real-world tasks, models, and data sets with multiple optional features.
    Aleatoric and Epistemic Discrimination in Classification. (arXiv:2301.11781v1 [cs.LG])
    Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy given fairness constraints and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.  ( 2 min )
    Meta-Learning Mini-Batch Risk Functionals. (arXiv:2301.11724v1 [cs.LG])
    Supervised learning typically optimizes the expected value risk functional of the loss, but in many cases, we want to optimize for other risk functionals. In full-batch gradient descent, this is done by taking gradients of a risk functional of interest, such as the Conditional Value at Risk (CVaR) which ignores some quantile of extreme losses. However, deep learning must almost always use mini-batch gradient descent, and lack of unbiased estimators of various risk functionals make the right optimization procedure unclear. In this work, we introduce a meta-learning-based method of learning an interpretable mini-batch risk functional during model training, in a single shot. When optimizing for various risk functionals, the learned mini-batch risk functions lead to risk reduction of up to 10% over hand-engineered mini-batch risk functionals. Then in a setting where the right risk functional is unknown a priori, our method improves over baseline by 14% relative (~9% absolute). We analyze the learned mini-batch risk functionals at different points through training, and find that they learn a curriculum (including warm-up periods), and that their final form can be surprisingly different from the underlying risk functional that they optimize for.
    Diverse Weight Averaging for Out-of-Distribution Generalization. (arXiv:2205.09739v2 [cs.CV] UPDATED)
    Standard neural networks struggle to generalize under distribution shifts in computer vision. Fortunately, combining multiple networks can consistently improve out-of-distribution generalization. In particular, weight averaging (WA) strategies were shown to perform best on the competitive DomainBed benchmark; they directly average the weights of multiple networks despite their nonlinearities. In this paper, we propose Diverse Weight Averaging (DiWA), a new WA strategy whose main motivation is to increase the functional diversity across averaged models. To this end, DiWA averages weights obtained from several independent training runs: indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between WA and standard functional ensembling. Moreover, this decomposition highlights that WA succeeds when the variance term dominates, which we show occurs when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on DomainBed without inference overhead.
    Leveraging the Third Dimension in Contrastive Learning. (arXiv:2301.11790v1 [cs.CV])
    Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.
    Hyperbolic VAE via Latent Gaussian Distributions. (arXiv:2209.15217v2 [cs.LG] UPDATED)
    We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent space consists of a set of Gaussian distributions. It is known that the set of the univariate Gaussian distributions with the Fisher information metric form a hyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed with the Gaussian manifolds, we propose a pseudo-Gaussian manifold normal distribution based on the Kullback-Leibler divergence, a local approximation of the squared Fisher-Rao distance, to define a density over the latent space. In experiments, we demonstrate the efficacy of GM-VAE on two different tasks: density estimation of image datasets and environment modeling in model-based reinforcement learning. GM-VAE outperforms the other variants of hyperbolic- and Euclidean-VAEs on density estimation tasks and shows competitive performance in model-based reinforcement learning. We observe that our model provides strong numerical stability, addressing a common limitation reported in previous hyperbolic-VAEs.
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then use our method to compare four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. In this application, we corroborate evidence for the recently proposed L\'evy flight model of decision-making and show how transfer learning can be leveraged to enhance training efficiency. Reproducible code for all analyses is provided.
    Provably Efficient Causal Model-Based Reinforcement Learning for Systematic Generalization. (arXiv:2202.06545v2 [cs.LG] UPDATED)
    In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another that is shared. Consider a set of environments that share the laws of motion as an example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide a tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that guarantees any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.
    Deep Multi-modal Fusion of Image and Non-image Data in Disease Diagnosis and Prognosis: A Review. (arXiv:2203.15588v3 [cs.LG] UPDATED)
    The rapid development of diagnostic technologies in healthcare is leading to higher requirements for physicians to handle and integrate the heterogeneous, yet complementary data that are produced during routine practice. For instance, the personalized diagnosis and treatment planning for a single cancer patient relies on the various images (e.g., radiological, pathological, and camera images) and non-image data (e.g., clinical data and genomic data). However, such decision-making procedures can be subjective, qualitative, and have large inter-subject variabilities. With the recent advances in multi-modal deep learning technologies, an increasingly large number of efforts have been devoted to a key question: how do we extract and aggregate multi-modal information to ultimately provide more objective, quantitative computer-aided clinical decision making? This paper reviews the recent studies on dealing with such a question. Briefly, this review will include the (1) overview of current multi-modal learning workflows, (2) summarization of multi-modal fusion methods, (3) discussion of the performance, (4) applications in disease diagnosis and prognosis, and (5) challenges and future directions.
    Constrained Clustering: General Pairwise and Cardinality Constraints. (arXiv:1907.10410v2 [cs.LG] UPDATED)
    We study constrained clustering, where constraints guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. Most existing constrained clustering models can only utilize one category of constraints at a time. We enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As the two categories provide different useful information, utilizing both allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the unified formulation. To solve this, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. We derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers. Extensive experiments on both synthetic and real data demonstrate when: (1) utilizing a single category of constraint, the proposed model is superior to or competitive with SOTA constrained clustering models, and (2) utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experiments show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to 2%-5% in classical clustering metrics, e.g. Adjusted Random, Mirkin's, and Huber's, indices outerperfomring other methods.
    Deep Clustering Survival Machines with Interpretable Expert Distributions. (arXiv:2301.11826v1 [cs.LG])
    Conventional survival analysis methods are typically ineffective to characterize heterogeneity in the population while such information can be used to assist predictive modeling. In this study, we propose a hybrid survival analysis method, referred to as deep clustering survival machines, that combines the discriminative and generative mechanisms. Similar to the mixture models, we assume that the timing information of survival data is generatively described by a mixture of certain numbers of parametric distributions, i.e., expert distributions. We learn weights of the expert distributions for individual instances according to their features discriminatively such that each instance's survival information can be characterized by a weighted combination of the learned constant expert distributions. This method also facilitates interpretable subgrouping/clustering of all instances according to their associated expert distributions. Extensive experiments on both real and synthetic datasets have demonstrated that the method is capable of obtaining promising clustering results and competitive time-to-event predicting performance.
    A Game-Theoretic Framework for Managing Risk in Multi-Agent Systems. (arXiv:2205.15434v2 [cs.LG] UPDATED)
    In order for agents in multi-agent systems (MAS) to be safe, they need to take into account the risks posed by the actions of other agents. However, the dominant paradigm in game theory (GT) assumes that agents are not affected by risk from other agents and only strive to maximise their expected utility. For example, in hybrid human-AI driving systems, it is necessary to limit large deviations in reward resulting from car crashes. Although there are equilibrium concepts in game theory that take into account risk aversion, they either assume that agents are risk-neutral with respect to the uncertainty caused by the actions of other agents, or they are not guaranteed to exist. We introduce a new GT-based Risk-Averse Equilibrium (RAE) that always produces a solution that minimises the potential variance in reward accounting for the strategy of other agents. Theoretically and empirically, we show RAE shares many properties with a Nash Equilibrium (NE), establishing convergence properties and generalising to risk-dominant NE in certain cases. To tackle large-scale problems, we extend RAE to the PSRO multi-agent reinforcement learning (MARL) framework. We empirically demonstrate the minimum reward variance benefits of RAE in matrix games with high-risk outcomes. Results on MARL experiments show RAE generalises to risk-dominant NE in a trust dilemma game and that it reduces instances of crashing by 7x in an autonomous driving setting versus the best performing baseline.
    DAG Learning on the Permutahedron. (arXiv:2301.11898v1 [cs.LG])
    We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.
    Robust variance-regularized risk minimization with concomitant scaling. (arXiv:2301.11584v1 [stat.ML])
    Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.  ( 2 min )
    Sparse Mixture-of-Experts are Domain Generalizable Learners. (arXiv:2206.04046v6 [cs.CV] UPDATED)
    Human visual perception can easily generalize to out-of-distributed visual data, which is far beyond the capability of modern machine learning models. Domain generalization (DG) aims to close this gap, with existing DG methods mainly focusing on the loss function design. In this paper, we propose to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely Generalizable Mixture-of-Experts (GMoE). Extensive experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Moreover, GMoE is complementary to existing DG methods and its performance is substantially improved when trained with DG algorithms.
    PECAN: A Deterministic Certified Defense Against Backdoor Attacks. (arXiv:2301.11824v1 [cs.CR])
    Neural networks are vulnerable to backdoor poisoning attacks, where the attackers maliciously poison the training set and insert triggers into the test input to change the prediction of the victim model. Existing defenses for backdoor attacks either provide no formal guarantees or come with expensive-to-compute and ineffective probabilistic guarantees. We present PECAN, an efficient and certified approach for defending against backdoor attacks. The key insight powering PECAN is to apply off-the-shelf test-time evasion certification techniques on a set of neural networks trained on disjoint partitions of the data. We evaluate PECAN on image classification and malware detection datasets. Our results demonstrate that PECAN can (1) significantly outperform the state-of-the-art certified backdoor defense, both in defense strength and efficiency, and (2) on real back-door attacks, PECAN can reduce attack success rate by order of magnitude when compared to a range of baselines from the literature.
    Naive Few-Shot Learning: Uncovering the fluid intelligence of machines. (arXiv:2205.12013v3 [cs.AI] UPDATED)
    In this paper, we aimed to help bridge the gap between human fluid intelligence - the ability to solve novel tasks without prior training - and the performance of deep neural networks, which typically require extensive prior training. An essential cognitive component for solving intelligence tests, which in humans are used to measure fluid intelligence, is the ability to identify regularities in sequences. This motivated us to construct a benchmark task, which we term \textit{sequence consistency evaluation} (SCE), whose solution requires the ability to identify regularities in sequences. Given the proven capabilities of deep networks, their ability to solve such tasks after extensive training is expected. Surprisingly, however, we show that naive (randomly initialized) deep learning models that are trained on a \textit{single} SCE with a \textit{single} optimization step can still solve non-trivial versions of the task relatively well. We extend our findings to solve, without any prior training, real-world anomaly detection tasks in the visual and auditory modalities. These results demonstrate the fluid-intelligent computational capabilities of deep networks. We discuss the implications of our work for constructing fluid-intelligent machines.
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v2 [math.ST] UPDATED)
    Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.
    Accelerating Domain-aware Deep Learning Models with Distributed Training. (arXiv:2301.11787v1 [cs.LG])
    Recent advances in data-generating techniques led to an explosive growth of geo-spatiotemporal data. In domains such as hydrology, ecology, and transportation, interpreting the complex underlying patterns of spatiotemporal interactions with the help of deep learning techniques hence becomes the need of the hour. However, applying deep learning techniques without domain-specific knowledge tends to provide sub-optimal prediction performance. Secondly, training such models on large-scale data requires extensive computational resources. To eliminate these challenges, we present a novel distributed domain-aware spatiotemporal network that utilizes domain-specific knowledge with improved model performance. Our network consists of a pixel-contribution block, a distributed multiheaded multichannel convolutional (CNN) spatial block, and a recurrent temporal block. We choose flood prediction in hydrology as a use case to test our proposed method. From our analysis, the network effectively predicts high peaks in discharge measurements at watershed outlets with up to 4.1x speedup and increased prediction performance of up to 93\%. Our approach achieved a 12.6x overall speedup and increased the mean prediction performance by 16\%. We perform extensive experiments on a dataset of 23 watersheds in a northern state of the U.S. and present our findings.
    Overparameterized Linear Regression under Adversarial Attacks. (arXiv:2204.06274v2 [stat.ML] UPDATED)
    We study the error of linear regression in the face of adversarial attacks. In this framework, an adversary changes the input to the regression model in order to maximize the prediction error. We provide bounds on the prediction error in the presence of an adversary as a function of the parameter norm and the error in the absence of such an adversary. We show how these bounds make it possible to study the adversarial error using analysis from non-adversarial setups. The obtained results shed light on the robustness of overparameterized linear models to adversarial attacks. Adding features might be either a source of additional robustness or brittleness. On the one hand, we use asymptotic results to illustrate how double-descent curves can be obtained for the adversarial error. On the other hand, we derive conditions under which the adversarial error can grow to infinity as more features are added, while at the same time, the test error goes to zero. We show this behavior is caused by the fact that the norm of the parameter vector grows with the number of features. It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks might behave fundamentally differently due to how the $\ell_1$ and $\ell_2$-norms of random projections concentrate. We also show how our reformulation allows for solving adversarial training as a convex optimization problem. This fact is then exploited to establish similarities between adversarial training and parameter-shrinking methods and to study how the training might affect the robustness of the estimated models.
    Feature Selection on Quantum Computers. (arXiv:2203.13261v2 [quant-ph] UPDATED)
    In machine learning, fewer features reduce model complexity. Carefully assessing the influence of each input feature on the model quality is therefore a crucial preprocessing step. We propose a novel feature selection algorithm based on a quadratic unconstrained binary optimization (QUBO) problem, which allows to select a specified number of features based on their importance and redundancy. In contrast to iterative or greedy methods, our direct approach yields higherquality solutions. QUBO problems are particularly interesting because they can be solved on quantum hardware. To evaluate our proposed algorithm, we conduct a series of numerical experiments using a classical computer, a quantum gate computer and a quantum annealer. Our evaluation compares our method to a range of standard methods on various benchmark datasets. We observe competitive performance.  ( 2 min )
    Constrained Monotonic Neural Networks. (arXiv:2205.11775v2 [cs.LG] UPDATED)
    Deep neural networks are becoming increasingly popular in approximating arbitrary functions from noisy data. But wider adoption is being hindered by the need to explain such models and to impose additional constraints on them. Monotonicity constraint is one of the most requested properties in real-world scenarios and is the focus of this paper. One of the oldest ways to construct a monotonic fully connected neural network is to constrain its weights to be non-negative while employing a monotonic activation function. Unfortunately, this construction does not work with popular non-saturated activation functions such as ReLU, ELU, SELU etc, as it can only approximate convex functions. We show this shortcoming can be fixed by employing the original activation function for a part of the neurons in the layer, and employing its point reflection for the other part. Our experiments show this approach of building monotonic deep neural networks have matching or better accuracy when compared to other state-of-the-art methods such as deep lattice networks or monotonic networks obtained by heuristic regularization. This method is the simplest one in the sense of having the least number of parameters, not requiring any modifications to the learning procedure or steps post-learning steps.
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v3 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.
    SOBER: Scalable Batch Bayesian Optimization and Quadrature using Recombination Constraints. (arXiv:2301.11832v1 [cs.LG])
    Batch Bayesian optimisation (BO) has shown to be a sample-efficient method of performing optimisation where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch BO with arbitrary acquisition functions, arbitrary input spaces (e.g. graph), and arbitrary kernels. The key to our approach is to reformulate batch selection for BO as a Bayesian quadrature (BQ) problem, which offers computational advantages. This reformulation is beneficial in solving BQ tasks reciprocally, which introduces the exploitative functionality of BO to BQ. We show that SOBER offers substantive performance gains in synthetic and real-world tasks, including drug discovery and simulation-based inference.
    Image Restoration with Mean-Reverting Stochastic Differential Equations. (arXiv:2301.11699v1 [cs.LG])
    This paper presents a stochastic differential equation (SDE) approach for general-purpose image restoration. The key construction consists in a mean-reverting SDE that transforms a high-quality image into a degraded counterpart as a mean state with fixed Gaussian noise. Then, by simulating the corresponding reverse-time SDE, we are able to restore the origin of the low-quality image without relying on any task-specific prior knowledge. Crucially, the proposed mean-reverting SDE has a closed-form solution, allowing us to compute the ground truth time-dependent score and learn it with a neural network. Moreover, we propose a maximum likelihood objective to learn an optimal reverse trajectory which stabilizes the training and improves the restoration results. In the experiments, we show that our proposed method achieves highly competitive performance in quantitative comparisons on image deraining, deblurring, and denoising, setting a new state-of-the-art on two deraining datasets. Finally, the general applicability of our approach is further demonstrated via qualitative results on image super-resolution, inpainting, and dehazing. Code is available at \url{https://github.com/Algolzw/image-restoration-sde}.
    Efficiently predicting high resolution mass spectra with graph neural networks. (arXiv:2301.11419v1 [cs.LG])
    Identifying a small molecule from its mass spectrum is the primary open problem in computational metabolomics. This is typically cast as information retrieval: an unknown spectrum is matched against spectra predicted computationally from a large database of chemical structures. However, current approaches to spectrum prediction model the output space in ways that force a tradeoff between capturing high resolution mass information and tractable learning. We resolve this tradeoff by casting spectrum prediction as a mapping from an input molecular graph to a probability distribution over molecular formulas. We discover that a large corpus of mass spectra can be closely approximated using a fixed vocabulary constituting only 2% of all observed formulas. This enables efficient spectrum prediction using an architecture similar to graph classification - GrAFF-MS - achieving significantly lower prediction error and orders-of-magnitude faster runtime than state-of-the-art methods.
    Constrained Submodular Optimization for Vaccine Design. (arXiv:2206.08336v2 [q-bio.QM] UPDATED)
    Advances in machine learning have enabled the prediction of immune system responses to prophylactic and therapeutic vaccines. However, the engineering task of designing vaccines remains a challenge. In particular, the genetic variability of the human immune system makes it difficult to design peptide vaccines that provide widespread immunity in vaccinated populations. We introduce a framework for evaluating and designing peptide vaccines that uses probabilistic machine learning models, and demonstrate its ability to produce designs for a SARS-CoV-2 vaccine that outperform previous designs. We provide a theoretical analysis of the approximability, scalability, and complexity of our framework.
    Regret Analysis of Learning-Based MPC with Partially-Unknown Cost Function. (arXiv:2108.02307v2 [math.OC] UPDATED)
    The exploration/exploitation trade-off is an inherent challenge in data-driven adaptive control. Though this trade-off has been studied for multi-armed bandits (MAB's) and reinforcement learning for linear systems; it is less well-studied for learning-based control of nonlinear systems. A significant theoretical challenge in the nonlinear setting is that there is no explicit characterization of an optimal controller for a given set of cost and system parameters. We propose the use of a finite-horizon oracle controller with full knowledge of parameters as a reasonable surrogate to optimal controller. This allows us to develop policies in the context of learning-based MPC and MAB's and conduct a control-theoretic analysis using techniques from MPC- and optimization-theory to show these policies achieve low regret with respect to this finite-horizon oracle. Our simulations exhibit the low regret of our policy on a heating, ventilation, and air-conditioning model with partially-unknown cost function.  ( 2 min )
    Achieving Risk Control in Online Learning Settings. (arXiv:2205.09095v7 [cs.LG] UPDATED)
    To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk -- such as coverage of confidence intervals, false negative rate, or F1 score -- in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion. The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost. We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks. To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks. Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.
    Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation. (arXiv:2301.11547v1 [cs.LG])
    Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.  ( 2 min )
    Targeted Attacks on Timeseries Forecasting. (arXiv:2301.11544v1 [cs.LG])
    Real-world deep learning models developed for Time Series Forecasting are used in several critical applications ranging from medical devices to the security domain. Many previous works have shown how deep learning models are prone to adversarial attacks and studied their vulnerabilities. However, the vulnerabilities of time series models for forecasting due to adversarial inputs are not extensively explored. While the attack on a forecasting model might aim to deteriorate the performance of the model, it is more effective, if the attack is focused on a specific impact on the model's output. In this paper, we propose a novel formulation of Directional, Amplitudinal, and Temporal targeted adversarial attacks on time series forecasting models. These targeted attacks create a specific impact on the amplitude and direction of the output prediction. We use the existing adversarial attack techniques from the computer vision domain and adapt them for time series. Additionally, we propose a modified version of the Auto Projected Gradient Descent attack for targeted attacks. We examine the impact of the proposed targeted attacks versus untargeted attacks. We use KS-Tests to statistically demonstrate the impact of the attack. Our experimental results show how targeted attacks on time series models are viable and are more powerful in terms of statistical similarity. It is, hence difficult to detect through statistical methods. We believe that this work opens a new paradigm in the time series forecasting domain and represents an important consideration for developing better defenses.  ( 2 min )
    Decentralized Online Bandit Optimization on Directed Graphs with Regret Bounds. (arXiv:2301.11802v1 [cs.LG])
    We consider a decentralized multiplayer game, played over $T$ rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint bandit-reward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.
    Multimodal and Explainable Internet Meme Classification. (arXiv:2212.05612v2 [cs.AI] UPDATED)
    Warning: this paper contains content that may be offensive or upsetting. In the current context where online platforms have been effectively weaponized in a variety of geo-political events and social issues, Internet memes make fair content moderation at scale even more difficult. Existing work on meme classification and tracking has focused on black-box methods that do not explicitly consider the semantics of the memes or the context of their creation. In this paper, we pursue a modular and explainable architecture for Internet meme understanding. We design and implement multimodal classification methods that perform example- and prototype-based reasoning over training cases, while leveraging both textual and visual SOTA models to represent the individual cases. We study the relevance of our modular and explainable models in detecting harmful memes on two existing tasks: Hate Speech Detection and Misogyny Classification. We compare the performance between example- and prototype-based methods, and between text, vision, and multimodal models, across different categories of harmfulness (e.g., stereotype and objectification). We devise a user-friendly interface that facilitates the comparative analysis of examples retrieved by all of our models for any given meme, informing the community about the strengths and limitations of these explainable methods.  ( 2 min )
    Certified Invertibility in Neural Networks via Mixed-Integer Programming. (arXiv:2301.11783v1 [cs.LG])
    Neural networks are notoriously vulnerable to adversarial attacks -- small imperceptible perturbations that can change the network's output drastically. In the reverse direction, there may exist large, meaningful perturbations that leave the network's decision unchanged (excessive invariance, nonivertibility). We study the latter phenomenon in two contexts: (a) discrete-time dynamical system identification, as well as (b) calibration of the output of one neural network to the output of another (neural network matching). For ReLU networks and $L_p$ norms ($p=1,2,\infty$), we formulate these optimization problems as mixed-integer programs (MIPs) that apply to neural network approximators of dynamical systems. We also discuss the applicability of our results to invertibility certification in transformations between neural networks (e.g. at different levels of pruning).
    Personalised Federated Learning On Heterogeneous Feature Spaces. (arXiv:2301.11447v1 [cs.LG])
    Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.  ( 2 min )
    Distributionally Robust Multi-objective Bayesian Optimization under Uncertain Environments. (arXiv:2301.11588v1 [stat.ML])
    In this study, we address the problem of optimizing multi-output black-box functions under uncertain environments. We formulate this problem as the estimation of the uncertain Pareto-frontier (PF) of a multi-output Bayesian surrogate model with two types of variables: design variables and environmental variables. We consider this problem within the context of Bayesian optimization (BO) under uncertain environments, where the design variables are controllable, whereas the environmental variables are assumed to be random and not controllable. The challenge of this problem is to robustly estimate the PF when the distribution of the environmental variables is unknown, that is, to estimate the PF when the environmental variables are generated from the worst possible distribution. We propose a method for solving the BO problem by appropriately incorporating the uncertainties of the environmental variables and their probability distribution. We demonstrate that the proposed method can find an arbitrarily accurate PF with high probability in a finite number of iterations. We also evaluate the performance of the proposed method through numerical experiments.  ( 2 min )
    Big portfolio selection by graph-based conditional moments method. (arXiv:2301.11697v1 [stat.ML])
    How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which guides the learning procedure through a factor-hypergraph built by the set of stock-to-stock relations from the domain knowledge as well as the set of factor-to-stock relations from the asset pricing knowledge. Next, the GRACE method learns the conditional variance, skewness, and kurtosis of stock returns from the learned conditional quantiles by using the quantiled conditional moment (QCM) method. The QCM method is a supervised learning procedure to learn these conditional higher-order moments, so it largely overcomes the computational difficulty from the classical high-dimensional GARCH-type methods. Moreover, the QCM method allows the mis-specification in modeling conditional quantiles to some extent, due to its regression-based nature. Finally, the GRACE method uses the learned conditional mean, variance, skewness, and kurtosis to construct several performance measures, which are criteria to sort the stocks to proceed the portfolio selection in the well-known 10-decile framework. An application to NASDAQ and NYSE stock markets shows that the GRACE method performs much better than its competitors, particularly when the performance measures are comprised of conditional variance, skewness, and kurtosis.
    Collaborative Regret Minimization in Multi-Armed Bandits. (arXiv:2301.11442v1 [cs.LG])
    In this paper, we study the collaborative learning model, which concerns the tradeoff between parallelism and communication overhead in multi-agent reinforcement learning. For a fundamental problem in bandit theory, regret minimization in multi-armed bandits, we present the first and almost tight tradeoffs between the number of rounds of communication between the agents and the regret of the collaborative learning process.  ( 2 min )
    Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence. (arXiv:2301.11476v1 [cs.LG])
    Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
    Neural Abstractions. (arXiv:2301.11683v1 [cs.LO])
    We present a novel method for the safety verification of nonlinear dynamical models that uses neural networks to represent abstractions of their dynamics. Neural networks have extensively been used before as approximators; in this work, we make a step further and use them for the first time as abstractions. For a given dynamical model, our method synthesises a neural network that overapproximates its dynamics by ensuring an arbitrarily tight, formally certified bound on the approximation error. For this purpose, we employ a counterexample-guided inductive synthesis procedure. We show that this produces a neural ODE with non-deterministic disturbances that constitutes a formal abstraction of the concrete model under analysis. This guarantees a fundamental property: if the abstract model is safe, i.e., free from any initialised trajectory that reaches an undesirable state, then the concrete model is also safe. By using neural ODEs with ReLU activation functions as abstractions, we cast the safety verification problem for nonlinear dynamical models into that of hybrid automata with affine dynamics, which we verify using SpaceEx. We demonstrate that our approach performs comparably to the mature tool Flow* on existing benchmark nonlinear models. We additionally demonstrate and that it is effective on models that do not exhibit local Lipschitz continuity, which are out of reach to the existing technologies.
    Semi-Parametric Video-Grounded Text Generation. (arXiv:2301.11507v1 [cs.CV])
    Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames. Parametric approaches such as the attention mechanism may not be ideal since its computational cost quadratically increases as the video length increases. Rather, previous studies have relied on offline feature extraction or frame sampling to represent the video efficiently, focusing on cross-modal modeling in short video clips. In this paper, we propose a semi-parametric video-grounded text generation model, SeViT, a novel perspective on scalable video-language modeling toward long untrimmed videos. Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames from the data store for a given query and a parametric generator to effectively aggregate the frames with the query via late fusion methods. Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding. Moreover, our model achieves the new state of the art on four video-language datasets, iVQA (+4.8), Next-QA (+6.9), and Activitynet-QA (+4.8) in accuracy, and MSRVTT-Caption (+3.6) in CIDEr.
    Single-Trajectory Distributionally Robust Reinforcement Learning. (arXiv:2301.11721v1 [stat.ML])
    As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.
    Outcome-directed Reinforcement Learning by Uncertainty & Temporal Distance-Aware Curriculum Goal Generation. (arXiv:2301.11741v1 [cs.LG])
    Current reinforcement learning (RL) often suffers when solving a challenging exploration problem where the desired outcomes or high rewards are rarely observed. Even though curriculum RL, a framework that solves complex tasks by proposing a sequence of surrogate tasks, shows reasonable results, most of the previous works still have difficulty in proposing curriculum due to the absence of a mechanism for obtaining calibrated guidance to the desired outcome state without any prior domain knowledge. To alleviate it, we propose an uncertainty & temporal distance-aware curriculum goal generation method for the outcome-directed RL via solving a bipartite matching problem. It could not only provide precisely calibrated guidance of the curriculum to the desired outcome states but also bring much better sample efficiency and geometry-agnostic curriculum goal proposal capability compared to previous curriculum RL methods. We demonstrate that our algorithm significantly outperforms these prior methods in a variety of challenging navigation tasks and robotic manipulation tasks in a quantitative and qualitative way.
    Semi-Supervised Machine Learning: a Homological Approach. (arXiv:2301.11658v1 [cs.LG])
    In this paper we describe the mathematical foundations of a new approach to semi-supervised Machine Learning. Using techniques of Symbolic Computation and Computer Algebra, we apply the concept of persistent homology to obtain a new semi-supervised learning method.
    Modeling human road crossing decisions as reward maximization with visual perception limitations. (arXiv:2301.11737v1 [cs.LG])
    Understanding the interaction between different road users is critical for road safety and automated vehicles (AVs). Existing mathematical models on this topic have been proposed based mostly on either cognitive or machine learning (ML) approaches. However, current cognitive models are incapable of simulating road user trajectories in general scenarios, and ML models lack a focus on the mechanisms generating the behavior and take a high-level perspective which can cause failures to capture important human-like behaviors. Here, we develop a model of human pedestrian crossing decisions based on computational rationality, an approach using deep reinforcement learning (RL) to learn boundedly optimal behavior policies given human constraints, in our case a model of the limited human visual system. We show that the proposed combined cognitive-RL model captures human-like patterns of gap acceptance and crossing initiation time. Interestingly, our model's decisions are sensitive to not only the time gap, but also the speed of the approaching vehicle, something which has been described as a "bias" in human gap acceptance behavior. However, our results suggest that this is instead a rational adaption to human perceptual limitations. Moreover, we demonstrate an approach to accounting for individual differences in computational rationality models, by conditioning the RL policy on the parameters of the human constraints. Our results demonstrate the feasibility of generating more human-like road user behavior by combining RL with cognitive models.
    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v1 [stat.ML])
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.
    Synopsis: Sequential Decision Problems with Weak Feedback. (arXiv:2212.11599v2 [cs.LG] UPDATED)
    This thesis considers sequential decision problems, where the loss/reward incurred by selecting an action may not be inferred from observed feedback. A major part of this thesis focuses on the unsupervised sequential selection problem, where one can not infer the loss incurred for selecting an action from observed feedback. We also introduce a new setup named Censored Semi Bandits, where the loss incurred for selecting an action can be observed under certain conditions. Finally, we study the channel selection problem in the communication networks, where the reward for an action is only observed when no other player selects that action to play in the round. These problems find applications in many fields like healthcare, crowd-sourcing, security, adaptive resource allocation, among many others. This thesis aims to address the above-described sequential decision problems by exploiting specific structures these problems exhibit. We develop provably optimal algorithms for each of these setups with weak feedback and validate their empirical performance on different problem instances derived from synthetic and real datasets.  ( 2 min )
    SLCNN: Sentence-Level Convolutional Neural Network for Text Classification. (arXiv:2301.11696v1 [cs.CL])
    Text classification is a fundamental task in natural language processing (NLP). Several recent studies show the success of deep learning on text processing. Convolutional neural network (CNN), as a popular deep learning model, has shown remarkable success in the task of text classification. In this paper, new baseline models have been studied for text classification using CNN. In these models, documents are fed to the network as a three-dimensional tensor representation to provide sentence-level analysis. Applying such a method enables the models to take advantage of the positional information of the sentences in the text. Besides, analysing adjacent sentences allows extracting additional features. The proposed models have been compared with the state-of-the-art models using several datasets. The results have shown that the proposed models have better performance, particularly in the longer documents.
    Uplifting Message Passing Neural Network with Graph Original Information. (arXiv:2210.05382v2 [cs.LG] UPDATED)
    Message passing neural networks (MPNNs) learn the representation of graph-structured data based on graph original information, including node features and graph structures, and have shown astonishing improvement in node classification tasks. However, the expressive power of MPNNs is upper bounded by the first-order Weisfeiler-Leman test and its accuracy still has room for improvement. This work studies how to improve MPNNs' expressiveness and generalizability by fully exploiting graph original information both theoretically and empirically. It further proposes a new GNN model called INGNN (INformation-enhanced Graph Neural Network) that leverages the insights to improve node classification performance. Extensive experiments on both synthetic and real datasets demonstrate the superiority (average rank 1.78) of our INGNN compared with state-of-the-art methods.  ( 2 min )
    Soft Labels for Rapid Satellite Object Detection. (arXiv:2212.00585v3 [cs.CV] UPDATED)
    Soft labels in image classification are vector representations of an image's true classification. In this paper, we investigate soft labels in the context of satellite object detection. We propose using detections as the basis for a new dataset of soft labels. Much of the effort in creating a high-quality model is gathering and annotating the training data. If we could use a model to generate a dataset for us, we could not only rapidly create datasets, but also supplement existing open-source datasets. Using a subset of the xView dataset, we train a YOLOv5 model to detect cars, planes, and ships. We then use that model to generate soft labels for the second training set which we then train and compare to the original model. We show that soft labels can be used to train a model that is almost as accurate as a model trained on the original data.  ( 2 min )
    Feasibility and Transferability of Transfer Learning: A Mathematical Framework. (arXiv:2301.11542v1 [cs.LG])
    Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from previous learning tasks to improve the performance of new ones. Despite its numerous empirical successes, theoretical analysis for transfer learning is limited. In this paper we build for the first time, to the best of our knowledge, a mathematical framework for the general procedure of transfer learning. Our unique reformulation of transfer learning as an optimization problem allows for the first time, analysis of its feasibility. Additionally, we propose a novel concept of transfer risk to evaluate transferability of transfer learning. Our numerical studies using the Office-31 dataset demonstrate the potential and benefits of incorporating transfer risk in the evaluation of transfer learning performance.
    A Robust Optimisation Perspective on Counterexample-Guided Repair of Neural Networks. (arXiv:2301.11342v1 [cs.LG])
    Counterexample-guided repair aims at creating neural networks with mathematical safety guarantees, facilitating the application of neural networks in safety-critical domains. However, whether counterexample-guided repair is guaranteed to terminate remains an open question. We approach this question by showing that counterexample-guided repair can be viewed as a robust optimisation algorithm. While termination guarantees for neural network repair itself remain beyond our reach, we prove termination for more restrained machine learning models and disprove termination in a general setting. We empirically study the practical implications of our theoretical results, demonstrating the suitability of common verifiers and falsifiers for repair despite a disadvantageous theoretical result. Additionally, we use our theoretical insights to devise a novel algorithm for repairing linear regression models, surpassing existing approaches.  ( 2 min )
    MLExchange: A web-based platform enabling exchangeable machine learning workflows for scientific studies. (arXiv:2208.09751v4 [cs.LG] UPDATED)
    Machine learning (ML) algorithms are showing a growing trend in helping the scientific communities across different disciplines and institutions to address large and diverse data problems. However, many available ML tools are programmatically demanding and computationally costly. The MLExchange project aims to build a collaborative platform equipped with enabling tools that allow scientists and facility users who do not have a profound ML background to use ML and computational resources in scientific discovery. At the high level, we are targeting a full user experience where managing and exchanging ML algorithms, workflows, and data are readily available through web applications. Since each component is an independent container, the whole platform or its individual service(s) can be easily deployed at servers of different scales, ranging from a personal device (laptop, smart phone, etc.) to high performance clusters (HPC) accessed (simultaneously) by many users. Thus, MLExchange renders flexible using scenarios -- users could either access the services and resources from a remote server or run the whole platform or its individual service(s) within their local network.  ( 2 min )
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v2 [cs.LG] UPDATED)
    Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in its most practical form. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. Existing analysis cannot conclude the convergence for such a challenging case. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. We develop a novel framework that evaluates and controls the error propagation between actor and critic systematically. To our knowledge, this is the first finite-time analysis for the online single-timescale actor-critic method. Our results compare favorably to the existing literature in terms of considering the most practical yet challenging settings and requiring weaker assumptions.  ( 2 min )
    Phy-Q as a measure for physical reasoning intelligence. (arXiv:2108.13696v3 [cs.AI] UPDATED)
    Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. We create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent using the physical scenarios we considered. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark  ( 2 min )
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v5 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.  ( 2 min )
    Differential Privacy has Bounded Impact on Fairness in Classification. (arXiv:2210.16242v2 [cs.LG] UPDATED)
    We theoretically study the impact of differential privacy on fairness in classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on accuracy conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.  ( 2 min )
    Optimized Sparse Matrix Operations for Reverse Mode Automatic Differentiation. (arXiv:2212.05159v2 [cs.LG] UPDATED)
    Sparse matrix representations are ubiquitous in computational science and machine learning, leading to significant reductions in compute time, in comparison to dense representation, for problems that have local connectivity. The adoption of sparse representation in leading ML frameworks such as PyTorch is incomplete, however, with support for both automatic differentiation and GPU acceleration missing. In this work, we present an implementation of a CSR-based sparse matrix wrapper for PyTorch with CUDA acceleration for basic matrix operations, as well as automatic differentiability. We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.  ( 2 min )
    On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2211.13208v2 [cs.LG] UPDATED)
    Sample-efficient offline reinforcement learning (RL) with linear function approximation has recently been studied extensively. Much of prior work has yielded the minimax-optimal bound of $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$, with $K$ being the number of episodes in the offline data. In this work, we seek to understand instance-dependent bounds for offline RL with function approximation. We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI), which leverages data bootstrapping and constrained optimization on top of pessimism. We show that under a partial data coverage assumption, that of \emph{concentrability} with respect to an optimal policy, the proposed algorithm yields a fast rate of $\tilde{\mathcal{O}}(\frac{1}{K})$ for offline RL when there is a positive gap in the optimal Q-value functions, even when the offline data were adaptively collected. Moreover, when the linear features of the optimal actions in the states reachable by an optimal policy span those reachable by the behavior policy and the optimal actions are unique, offline RL achieves absolute zero sub-optimality error when $K$ exceeds a (finite) instance-dependent threshold. To the best of our knowledge, these are the first $\tilde{\mathcal{O}}(\frac{1}{K})$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data with partial coverage. We also provide instance-agnostic and instance-dependent information-theoretical lower bounds to complement our upper bounds.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v4 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v3 [stat.ML] UPDATED)
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators. (arXiv:2210.06812v2 [cs.LG] UPDATED)
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.  ( 2 min )
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v2 [stat.ML] UPDATED)
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.  ( 2 min )
    Generalizability of Adversarial Robustness Under Distribution Shifts. (arXiv:2209.15042v2 [cs.LG] UPDATED)
    Recent progress in empirical and certified robustness promises to deliver reliable and deployable Deep Neural Networks (DNNs). Despite that success, most existing evaluations of DNN robustness have been done on images sampled from the same distribution on which the model was trained. However, in the real world, DNNs may be deployed in dynamic environments that exhibit significant distribution shifts. In this work, we take a first step towards thoroughly investigating the interplay between empirical and certified adversarial robustness on one hand and domain generalization on another. To do so, we train robust models on multiple domains and evaluate their accuracy and robustness on an unseen domain. We observe that: (1) both empirical and certified robustness generalize to unseen domains, and (2) the level of generalizability does not correlate well with input visual similarity, measured by the FID between source and target domains. We also extend our study to cover a real-world medical application, in which adversarial augmentation significantly boosts the generalization of robustness with minimal effect on clean data accuracy.  ( 2 min )
    Commonsense Knowledge Salience Evaluation with a Benchmark Dataset in E-commerce. (arXiv:2205.10843v2 [cs.CL] CROSS LISTED)
    In e-commerce, the salience of commonsense knowledge (CSK) is beneficial for widespread applications such as product search and recommendation. For example, when users search for ``running'' in e-commerce, they would like to find products highly related to running, such as ``running shoes'' rather than ``shoes''. Nevertheless, many existing CSK collections rank statements solely by confidence scores, and there is no information about which ones are salient from a human perspective. In this work, we define the task of supervised salience evaluation, where given a CSK triple, the model is required to learn whether the triple is salient or not. In addition to formulating the new task, we also release a new Benchmark dataset of Salience Evaluation in E-commerce (BSEE) and hope to promote related research on commonsense knowledge salience evaluation. We conduct experiments in the dataset with several representative baseline models. The experimental results show that salience evaluation is a challenging task where models perform poorly on our evaluation set. We further propose a simple but effective approach, PMI-tuning, which shows promise for solving this novel problem. Code is available in \url{https://github.com/OpenBGBenchmark/OpenBG-CSK.  ( 2 min )
    Task-Agnostic Graph Neural Network Evaluation via Adversarial Collaboration. (arXiv:2301.11517v1 [cs.LG])
    It has been increasingly demanding to develop reliable Graph Neural Network (GNN) evaluation methods to quantify the progress of the rapidly expanding GNN research. Existing GNN benchmarking methods focus on comparing the GNNs with respect to their performances on some node/graph classification/regression tasks in certain datasets. There lacks a principled, task-agnostic method to directly compare two GNNs. Moreover, most of the existing graph self-supervised learning (SSL) works incorporate handcrafted augmentations to the graph, which has several severe difficulties due to the unique characteristics of graph-structured data. To address the aforementioned issues, we propose GraphAC (Graph Adversarial Collaboration) -- a conceptually novel, principled, task-agnostic, and stable framework for evaluating GNNs through contrastive self-supervision. GraphAC succeeds in distinguishing GNNs of different expressiveness across various aspects, and has been proven to be a principled and reliable GNN evaluation method, eliminating the need for handcrafted augmentations for stable SSL.  ( 2 min )
    Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data. (arXiv:2207.03615v2 [cs.LG] UPDATED)
    This paper analyzes the convergence and generalization of training a one-hidden-layer neural network when the input features follow the Gaussian mixture model consisting of a finite number of Gaussian distributions. Assuming the labels are generated from a teacher model with an unknown ground truth weight, the learning problem is to estimate the underlying teacher model by minimizing a non-convex risk function over a student neural network. With a finite number of training samples, referred to the sample complexity, the iterations are proved to converge linearly to a critical point with guaranteed generalization error. In addition, for the first time, this paper characterizes the impact of the input distributions on the sample complexity and the learning rate.  ( 2 min )
    GATE: Gated Additive Tree Ensemble for Tabular Classification and Regression. (arXiv:2207.08548v4 [cs.LG] UPDATED)
    We propose a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data, Gated Additive Tree Ensemble(GATE). GATE uses a gating mechanism, inspired from GRU, as a feature representation learning unit with an in-built feature selection mechanism. We combine it with an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict our desired output. We demonstrate that GATE is a competitive alternative to SOTA approaches like GBDTs, NODE, FT Transformers, etc. by experiments on several public datasets (both classification and regression). We have made available the code at https://github.com/manujosephv/GATE under MIT License.  ( 2 min )
    Explaining Patterns in Data with Language Models via Interpretable Autoprompting. (arXiv:2210.01848v2 [cs.LG] UPDATED)
    Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery. All code for using the methods and data here is made available on Github.  ( 2 min )
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v2 [cs.LG] UPDATED)
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).  ( 2 min )
    Statistical Inference for the Dynamic Time Warping Distance, with Application to Abnormal Time-Series Detection. (arXiv:2202.06593v2 [stat.ML] UPDATED)
    We study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too difficult to derive because it is obtained based on the solution of the DTW algorithm, which is complicated. To circumvent this difficulty, we propose to employ the conditional selective inference framework, which enables us to derive a valid inference method on the DTW distance. To our knowledge, this is the first method that can provide a valid p-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making such as abnormal time-series detection problems. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.  ( 2 min )
    Constrained Parameter Inference as a Principle for Learning. (arXiv:2203.13203v5 [cs.NE] UPDATED)
    Learning in neural networks is often framed as a problem in which targeted error signals are directly propagated to parameters and used to produce updates that induce more optimal network behaviour. Backpropagation of error (BP) is an example of such an approach and has proven to be a highly successful application of stochastic gradient descent to deep neural networks. We propose constrained parameter inference (COPI) as a new principle for learning. The COPI approach assumes that learning can be set up in a manner where parameters infer their own values based upon observations of their local neuron activities. We find that this estimation of network parameters is possible under the constraints of decorrelated neural inputs and top-down perturbations of neural states for credit assignment. We show that the decorrelation required for COPI allows learning at extremely high learning rates, competitive with that of adaptive optimizers, as used by BP. We further demonstrate that COPI affords a new approach to feature analysis and network compression. Finally, we argue that COPI may shed new light on learning in biological networks given the evidence for decorrelation in the brain.
    Demystifying Reinforcement Learning in Time-Varying Systems. (arXiv:2201.05560v2 [cs.LG] UPDATED)
    Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity, shed light on the range of approaches to them and develop a robust framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (i) identifies different environments encountered by the live system, (ii) triggers exploration when necessary, (iii) takes precautions to retain knowledge from prior environments, and (iv) employs safeguards to protect the system's performance when the RL agent makes mistakes. We apply our framework to two systems problems, straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that all components of the framework are necessary to cope with non-stationarity and provide guidance on alternative design choices for each component.
    AdaBoost is not an Optimal Weak to Strong Learner. (arXiv:2301.11571v1 [cs.LG])
    AdaBoost is a classic boosting algorithm for combining multiple inaccurate classifiers produced by a weak learner, to produce a strong learner with arbitrarily high accuracy when given enough training data. Determining the optimal number of samples necessary to obtain a given accuracy of the strong learner, is a basic learning theoretic question. Larsen and Ritzert (NeurIPS'22) recently presented the first provably optimal weak-to-strong learner. However, their algorithm is somewhat complicated and it remains an intriguing question whether the prototypical boosting algorithm AdaBoost also makes optimal use of training samples. In this work, we answer this question in the negative. Concretely, we show that the sample complexity of AdaBoost, and other classic variations thereof, are sub-optimal by at least one logarithmic factor in the desired accuracy of the strong learner.  ( 2 min )
    A critical look at deep neural network for dynamic system modeling. (arXiv:2301.11604v1 [cs.LG])
    Neural network models become increasingly popular as dynamic modeling tools in the control community. They have many appealing features including nonlinear structures, being able to approximate any functions. While most researchers hold optimistic attitudes towards such models, this paper questions the capability of (deep) neural networks for the modeling of dynamic systems using input-output data. For the identification of linear time-invariant (LTI) dynamic systems, two representative neural network models, Long Short-Term Memory (LSTM) and Cascade Foward Neural Network (CFNN) are compared to the standard Prediction Error Method (PEM) of system identification. In the comparison, four essential aspects of system identification are considered, then several possible defects and neglected issues of neural network based modeling are pointed out. Detailed simulation studies are performed to verify these defects: for the LTI system, both LSTM and CFNN fail to deliver consistent models even in noise-free cases; and they give worse results than PEM in noisy cases.  ( 2 min )
    Neural Episodic Control with State Abstraction. (arXiv:2301.11490v1 [cs.LG])
    Existing Deep Reinforcement Learning (DRL) algorithms suffer from sample inefficiency. Generally, episodic control-based approaches are solutions that leverage highly-rewarded past experiences to improve sample efficiency of DRL algorithms. However, previous episodic control-based approaches fail to utilize the latent information from the historical behaviors (e.g., state transitions, topological similarities, etc.) and lack scalability during DRL training. This work introduces Neural Episodic Control with State Abstraction (NECSA), a simple but effective state abstraction-based episodic control containing a more comprehensive episodic memory, a novel state evaluation, and a multi-step state analysis. We evaluate our approach to the MuJoCo and Atari tasks in OpenAI gym domains. The experimental results indicate that NECSA achieves higher sample efficiency than the state-of-the-art episodic control-based approaches. Our data and code are available at the project website\footnote{\url{https://sites.google.com/view/drl-necsa}}.  ( 2 min )
    Cellular Network Capacity and Coverage Enhancement with MDT Data and Deep Reinforcement Learning. (arXiv:2202.10968v1 [cs.NI] CROSS LISTED)
    Recent years witnessed a remarkable increase in the availability of data and computing resources in communication networks. This contributed to the rise of data-driven over model-driven algorithms for network automation. This paper investigates a Minimization of Drive Tests (MDT)-driven Deep Reinforcement Learning (DRL) algorithm to optimize coverage and capacity by tuning antennas tilts on a cluster of cells from TIM's cellular network. We jointly utilize MDT data, electromagnetic simulations, and network Key Performance indicators (KPIs) to define a simulated network environment for the training of a Deep Q-Network (DQN) agent. Some tweaks have been introduced to the classical DQN formulation to improve the agent's sample efficiency, stability, and performance. In particular, a custom exploration policy is designed to introduce soft constraints at training time. Results show that the proposed algorithm outperforms baseline approaches like DQN and best-fist search in terms of long-term reward and sample efficiency. Our results indicate that MDT-driven approaches constitute a valuable tool for autonomous coverage and capacity optimization of mobile radio networks.
    Fine-tuning Neural-Operator architectures for training and generalization. (arXiv:2301.11509v1 [cs.LG])
    In this work, we present an analysis of the generalization of Neural Operators (NOs) and derived architectures. We proposed a family of networks, which we name (${\textit{s}}{\text{NO}}+\varepsilon$), where we modify the layout of NOs towards an architecture resembling a Transformer; mainly, we substitute the Attention module with the Integral Operator part of NOs. The resulting network preserves universality, has a better generalization to unseen data, and similar number of parameters as NOs. On the one hand, we study numerically the generalization by gradually transforming NOs into ${\textit{s}}{\text{NO}}+\varepsilon$ and verifying a reduction of the test loss considering a time-harmonic wave dataset with different frequencies. We perform the following changes in NOs: (a) we split the Integral Operator (non-local) and the (local) feed-forward network (MLP) into different layers, generating a {\it sequential} structure which we call sequential Neural Operator (${\textit{s}}{\text{NO}}$), (b) we add the skip connection, and layer normalization in ${\textit{s}}{\text{NO}}$, and (c) we incorporate dropout and stochastic depth that allows us to generate deep networks. In each case, we observe a decrease in the test loss in a wide variety of initialization, indicating that our changes outperform the NO. On the other hand, building on infinite-dimensional Statistics, and in particular the Dudley Theorem, we provide bounds of the Rademacher complexity of NOs and ${\textit{s}}{\text{NO}}$, and we find the following relationship: the upper bound of the Rademacher complexity of the ${\textit{s}}{\text{NO}}$ is a lower-bound of the NOs, thereby, the generalization error bound of ${\textit{s}}{\text{NO}}$ is smaller than NO, which further strengthens our numerical results.
    Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2209.06620v3 [cs.LG] UPDATED)
    Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.
    Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions. (arXiv:2301.11885v1 [stat.ML])
    Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.
    ExplainableFold: Understanding AlphaFold Prediction with Explainable AI. (arXiv:2301.11765v1 [cs.AI])
    This paper presents ExplainableFold, an explainable AI framework for protein structure prediction. Despite the success of AI-based methods such as AlphaFold in this field, the underlying reasons for their predictions remain unclear due to the black-box nature of deep learning models. To address this, we propose a counterfactual learning framework inspired by biological principles to generate counterfactual explanations for protein structure prediction, enabling a dry-lab experimentation approach. Our experimental results demonstrate the ability of ExplainableFold to generate high-quality explanations for AlphaFold's predictions, providing near-experimental understanding of the effects of amino acids on 3D protein structure. This framework has the potential to facilitate a deeper understanding of protein structures.
    Gene Teams are on the Field: Evaluation of Variants in Gene-Networks Using High Dimensional Modelling. (arXiv:2301.11763v1 [cs.LG])
    In medical genetics, each genetic variant is evaluated as an independent entity regarding its clinical importance. However, in most complex diseases, variant combinations in specific gene networks, rather than the presence of a particular single variant, predominates. In the case of complex diseases, disease status can be evaluated by considering the success level of a team of specific variants. We propose a high dimensional modelling based method to analyse all the variants in a gene network together. To evaluate our method, we selected two gene networks, mTOR and TGF-Beta. For each pathway, we generated 400 control and 400 patient group samples. mTOR and TGF-? pathways contain 31 and 93 genes of varying sizes, respectively. We produced Chaos Game Representation images for each gene sequence to obtain 2-D binary patterns. These patterns were arranged in succession, and a 3-D tensor structure was achieved for each gene network. Features for each data sample were acquired by exploiting Enhanced Multivariance Products Representation to 3-D data. Features were split as training and testing vectors. Training vectors were employed to train a Support Vector Machines classification model. We achieved more than 96% and 99% classification accuracies for mTOR and TGF-Beta networks, respectively, using a limited amount of training samples.
    Invariant Meta Learning for Out-of-Distribution Generalization. (arXiv:2301.11779v1 [cs.LG])
    Modern deep learning techniques have illustrated their excellent capabilities in many areas, but relies on large training data. Optimization-based meta-learning train a model on a variety tasks, such that it can solve new learning tasks using only a small number of training samples.However, these methods assumes that training and test dataare identically and independently distributed. To overcome such limitation, in this paper, we propose invariant meta learning for out-of-distribution tasks. Specifically, invariant meta learning find invariant optimal meta-initialization,and fast adapt to out-of-distribution tasks with regularization penalty. Extensive experiments demonstrate the effectiveness of our proposed invariant meta learning on out-of-distribution few-shot tasks.
    Detecting Pump&Dump Stock Market Manipulation from Online Forums. (arXiv:2301.11403v1 [cs.SI])
    The intersection of social media, low-cost trading platforms, and naive investors has created an ideal situation for information-based market manipulations, especially pump&dumps. Manipulators accumulate small-cap stocks, disseminate false information on social media to inflate their price, and sell at the peak. We collect a dataset of stocks whose price and volume profiles have the characteristic shape of a pump&dump, and social media posts for those same stocks that match the timing of the initial price rises. From these we build predictive models for pump&dump events based on the language used in the social media posts. There are multiple difficulties: not every post will cause the intended market reaction, some pump&dump events may be triggered by posts in other forums, and there may be accidental confluences of post timing and market movements. Nevertheless, our best model achieves a prediction accuracy of 85% and an F1-score of 62%. Such a tool can provide early warning to investors and regulators that a pump&dump may be underway.
    Deep Residual Compensation Convolutional Network without Backpropagation. (arXiv:2301.11663v1 [cs.CV])
    PCANet and its variants provided good accuracy results for classification tasks. However, despite the importance of network depth in achieving good classification accuracy, these networks were trained with a maximum of nine layers. In this paper, we introduce a residual compensation convolutional network, which is the first PCANet-like network trained with hundreds of layers while improving classification accuracy. The design of the proposed network consists of several convolutional layers, each followed by post-processing steps and a classifier. To correct the classification errors and significantly increase the network's depth, we train each layer with new labels derived from the residual information of all its preceding layers. This learning mechanism is accomplished by traversing the network's layers in a single forward pass without backpropagation or gradient computations. Our experiments on four distinct classification benchmarks (MNIST, CIFAR-10, CIFAR-100, and TinyImageNet) show that our deep network outperforms all existing PCANet-like networks and is competitive with several traditional gradient-based models.  ( 2 min )
    Convolutional neural networks for valid and efficient causal inference. (arXiv:2301.11732v1 [stat.ML])
    Convolutional neural networks (CNN) have been successful in machine learning applications. Their success relies on their ability to consider space invariant local features. We consider the use of CNN to fit nuisance models in semiparametric estimation of the average causal effect of a treatment. In this setting, nuisance models are functions of pre-treatment covariates that need to be controlled for. In an application where we want to estimate the effect of early retirement on a health outcome, we propose to use CNN to control for time-structured covariates. Thus, CNN is used when fitting nuisance models explaining the treatment and the outcome. These fits are then combined into an augmented inverse probability weighting estimator yielding efficient and uniformly valid inference. Theoretically, we contribute by providing rates of convergence for CNN equipped with the rectified linear unit activation function and compare it to an existing result for feedforward neural networks. We also show when those rates guarantee uniformly valid inference. A Monte Carlo study is provided where the performance of the proposed estimator is evaluated and compared with other strategies. Finally, we give results on a study of the effect of early retirement on hospitalization using data covering the whole Swedish population.  ( 2 min )
    Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers. (arXiv:2301.11578v1 [cs.LG])
    Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.  ( 2 min )
    CAPoW: Context-Aware AI-Assisted Proof of Work based DDoS Defense. (arXiv:2301.11767v1 [cs.CR])
    Critical servers can be secured against distributed denial of service (DDoS) attacks using proof of work (PoW) systems assisted by an Artificial Intelligence (AI) that learns contextual network request patterns. In this work, we introduce CAPoW, a context-aware anti-DDoS framework that injects latency adaptively during communication by utilizing context-aware PoW puzzles. In CAPoW, a security professional can define relevant request context attributes which can be learned by the AI system. These contextual attributes can include information about the user request, such as IP address, time, flow-level information, etc., and are utilized to generate a contextual score for incoming requests that influence the hardness of a PoW puzzle. These puzzles need to be solved by a user before the server begins to process their request. Solving puzzles slow down the volume of incoming adversarial requests. Additionally, the framework compels the adversary to incur a cost per request, hence making it expensive for an adversary to prolong a DDoS attack. We include the theoretical foundations of the CAPoW framework along with a description of its implementation and evaluation.
    A denoting diffusion model for fluid flow prediction. (arXiv:2301.11661v1 [cs.LG])
    We propose a novel denoising diffusion generative model for predicting nonlinear fluid fields named FluidDiff. By performing a diffusion process, the model is able to learn a complex representation of the high-dimensional dynamic system, and then Langevin sampling is used to generate predictions for the flow state under specified initial conditions. The model is trained with finite, discrete fluid simulation data. We demonstrate that our model has the capacity to model the distribution of simulated training data and that it gives accurate predictions on the test data. Without encoded prior knowledge of the underlying physical system, it shares competitive performance with other deep learning models for fluid prediction, which is promising for investigation on new computational fluid dynamics methods.  ( 2 min )
    TransNet: Transferable Neural Networks for Partial Differential Equations. (arXiv:2301.11701v1 [math.NA])
    Transfer learning for partial differential equations (PDEs) is to develop a pre-trained neural network that can be used to solve a wide class of PDEs. Existing transfer learning approaches require much information of the target PDEs such as its formulation and/or data of its solution for pre-training. In this work, we propose to construct transferable neural feature spaces from purely function approximation perspectives without using PDE information. The construction of the feature space involves re-parameterization of the hidden neurons and uses auxiliary functions to tune the resulting feature space. Theoretical analysis shows the high quality of the produced feature space, i.e., uniformly distributed neurons. Extensive numerical experiments verify the outstanding performance of our method, including significantly improved transferability, e.g., using the same feature space for various PDEs with different domains and boundary conditions, and the superior accuracy, e.g., several orders of magnitude smaller mean squared error than the state of the art methods.  ( 2 min )
    Mixed Attention Network for Hyperspectral Image Denoising. (arXiv:2301.11525v1 [cs.CV])
    Hyperspectral image denoising is unique for the highly similar and correlated spectral information that should be properly considered. However, existing methods show limitations in exploring the spectral correlations across different bands and feature interactions within each band. Besides, the low- and high-level features usually exhibit different importance for different spatial-spectral regions, which is not fully explored for current algorithms as well. In this paper, we present a Mixed Attention Network (MAN) that simultaneously considers the inter- and intra-spectral correlations as well as the interactions between low- and high-level spatial-spectral meaningful features. Specifically, we introduce a multi-head recurrent spectral attention that efficiently integrates the inter-spectral features across all the spectral bands. These features are further enhanced with a progressive spectral channel attention by exploring the intra-spectral relationships. Moreover, we propose an attentive skip-connection that adaptively controls the proportion of the low- and high-level spatial-spectral features from the encoder and decoder to better enhance the aggregated features. Extensive experiments show that our MAN outperforms existing state-of-the-art methods on simulated and real noise settings while maintaining a low cost of parameters and running time.
    FedHP: Heterogeneous Federated Learning with Privacy-preserving. (arXiv:2301.11705v1 [cs.LG])
    Federated Learning is a distributed machine learning environment, which ensures that clients complete collaborative training without sharing private data, only by exchanging parameters. However, the data does not satisfy the same distribution and the computing resources of clients are different, which brings challenges to the related research. To better solve the above heterogeneous problems, we designed a novel federated learning method. The local model consists of the pre-trained model as the backbone and fully connected layers as the head. The backbone can extract features for the head, and the embedding vector of classes is shared between clients to optimize the head so that the local model can perform better. By sharing the embedding vector of classes, instead of parameters based on gradient space, clients can better adapt to private data, and it is more efficient in the communication between the server and clients. To better protect privacy, we proposed a privacy-preserving hybrid method to add noise to the embedding vector of classes, which has less impact on the local model performance under the premise of satisfying differential privacy. We conduct a comprehensive evaluation with other federated learning methods on the self-built vehicle dataset under non-independent identically distributed(Non-IID)  ( 2 min )
    Variance, Self-Consistency, and Arbitrariness in Fair Classification. (arXiv:2301.11562v1 [cs.LG])
    In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.  ( 2 min )
    Incorporating Knowledge into Document Summarization: an Application of Prefix-Tuning on GPT-2. (arXiv:2301.11719v1 [cs.CL])
    Despite the great development of document summarization techniques nowadays, factual inconsistencies between the generated summaries and the original text still occur from time to time. This paper proposes a prefix-tuning-based approach that uses a set of trainable continuous prefix prompt together with discrete prompts to aid model generation, which makes a significant impact on both CNN/Daily Mail and XSum summaries generated using GPT-2. The improvements on fact preservation in the generated summaries indicates the effectiveness of adopting this prefix-tuning-based method in knowledge-enhanced document summarization, and also shows a great potential on other natural language processing tasks.  ( 2 min )
    Behaviour Discriminator: A Simple Data Filtering Method to Improve Offline Policy Learning. (arXiv:2301.11734v1 [cs.LG])
    This paper studies the problem of learning a control policy without the need for interactions with the environment; instead, learning purely from an existing dataset. Prior work has demonstrated that offline learning algorithms (e.g., behavioural cloning and offline reinforcement learning) are more likely to discover a satisfactory policy when trained using high-quality expert data. However, many real-world/practical datasets can contain significant proportions of examples generated using low-skilled agents. Therefore, we propose a behaviour discriminator (BD) concept, a novel and simple data filtering approach based on semi-supervised learning, which can accurately discern expert data from a mixed-quality dataset. Our BD approach was used to pre-process the mixed-skill-level datasets from the Real Robot Challenge (RRC) III, an open competition requiring participants to solve several dexterous robotic manipulation tasks using offline learning methods; the new BD method allowed a standard behavioural cloning algorithm to outperform other more sophisticated offline learning algorithms. Moreover, we demonstrate that the new BD pre-processing method can be applied to a number of D4RL benchmark problems, improving the performance of multiple state-of-the-art offline reinforcement learning algorithms.  ( 2 min )
    Large-Scale Traffic Data Imputation with Spatiotemporal Semantic Understanding. (arXiv:2301.11691v1 [cs.LG])
    Large-scale data missing is a challenging problem in Intelligent Transportation Systems (ITS). Many studies have been carried out to impute large-scale traffic data by considering their spatiotemporal correlations at a network level. In existing traffic data imputations, however, rich semantic information of a road network has been largely ignored when capturing network-wide spatiotemporal correlations. This study proposes a Graph Transformer for Traffic Data Imputation (GT-TDI) model to impute large-scale traffic data with spatiotemporal semantic understanding of a road network. Specifically, the proposed model introduces semantic descriptions consisting of network-wide spatial and temporal information of traffic data to help the GT-TDI model capture spatiotemporal correlations at a network level. The proposed model takes incomplete data, the social connectivity of sensors, and semantic descriptions as input to perform imputation tasks with the help of Graph Neural Networks (GNN) and Transformer. On the PeMS freeway dataset, extensive experiments are conducted to compare the proposed GT-TDI model with conventional methods, tensor factorization methods, and deep learning-based methods. The results show that the proposed GT-TDI outperforms existing methods in complex missing patterns and diverse missing rates. The code of the GT-TDI model will be available at https://github.com/KP-Zhang/GT-TDI.  ( 2 min )
    A Green(er) World for A.I. (arXiv:2301.11581v1 [cs.AI])
    As research and practice in artificial intelligence (A.I.) grow in leaps and bounds, the resources necessary to sustain and support their operations also grow at an increasing pace. While innovations and applications from A.I. have brought significant advances, from applications to vision and natural language to improvements to fields like medical imaging and materials engineering, their costs should not be neglected. As we embrace a world with ever-increasing amounts of data as well as research and development of A.I. applications, we are sure to face an ever-mounting energy footprint to sustain these computational budgets, data storage needs, and more. But, is this sustainable and, more importantly, what kind of setting is best positioned to nurture such sustainable A.I. in both research and practice? In this paper, we outline our outlook for Green A.I. -- a more sustainable, energy-efficient and energy-aware ecosystem for developing A.I. across the research, computing, and practitioner communities alike -- and the steps required to arrive there. We present a bird's eye view of various areas for potential changes and improvements from the ground floor of AI's operational and hardware optimizations for datacenters/HPCs to the current incentive structures in the world of A.I. research and practice, and more. We hope these points will spur further discussion, and action, on some of these issues and their potential solutions.  ( 2 min )
    Solving Constrained Reinforcement Learning through Augmented State and Reward Penalties. (arXiv:2301.11592v1 [cs.LG])
    Constrained Reinforcement Learning has been employed to enforce safety constraints on policy through the use of expected cost constraints. The key challenge is in handling expected cost accumulated using the policy and not just in a single step. Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions (at each time step). While such approaches have provided good solutions with regards to objective, they can either be overly aggressive or conservative with respect to costs. This is owing to use of estimates for "future" or "backward" costs in local cost constraints. To that end, we provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties. This intuitive formulation is general and has interesting theoretical properties. More importantly, this provides a new paradigm for solving constrained RL problems effectively. As we show in our experimental results, we are able to outperform leading approaches on multiple benchmark problems from literature.
    Learning to Generate All Feasible Actions. (arXiv:2301.11461v1 [cs.LG])
    Several machine learning (ML) applications are characterized by searching for an optimal solution to a complex task. The search space for this optimal solution is often very large, so large in fact that this optimal solution is often not computable. Part of the problem is that many candidate solutions found via ML are actually infeasible and have to be discarded. Restricting the search space to only the feasible solution candidates simplifies finding an optimal solution for the tasks. Further, the set of feasible solutions could be re-used in multiple problems characterized by different tasks. In particular, we observe that complex tasks can be decomposed into subtasks and corresponding skills. We propose to learn a reusable and transferable skill by training an actor to generate all feasible actions. The trained actor can then propose feasible actions, among which an optimal one can be chosen according to a specific task. The actor is trained by interpreting the feasibility of each action as a target distribution. The training procedure minimizes a divergence of the actor's output distribution to this target. We derive the general optimization target for arbitrary f-divergences using a combination of kernel density estimates, resampling, and importance sampling. We further utilize an auxiliary critic to reduce the interactions with the environment. A preliminary comparison to related strategies shows that our approach learns to visit all the modes in the feasible action space, demonstrating the framework's potential for learning skills that can be used in various downstream tasks.
    Down the Rabbit Hole: Detecting Online Extremism, Radicalisation, and Politicised Hate Speech. (arXiv:2301.11579v1 [cs.SI])
    Social media is a modern person's digital voice to project and engage with new ideas and mobilise communities $\unicode{x2013}$ a power shared with extremists. Given the societal risks of unvetted content-moderating algorithms for Extremism, Radicalisation, and Hate speech (ERH) detection, responsible software engineering must understand the who, what, when, where, and why such models are necessary to protect user safety and free expression. Hence, we propose and examine the unique research field of ERH context mining to unify disjoint studies. Specifically, we evaluate the start-to-finish design process from socio-technical definition-building and dataset collection strategies to technical algorithm design and performance. Our 2015-2021 51-study Systematic Literature Review (SLR) provides the first cross-examination of textual, network, and visual approaches to detecting extremist affiliation, hateful content, and radicalisation towards groups and movements. We identify consensus-driven ERH definitions and propose solutions to existing ideological and geographic biases, particularly due to the lack of research in Oceania/Australasia. Our hybridised investigation on Natural Language Processing, Community Detection, and visual-text models demonstrates the dominating performance of textual transformer-based algorithms. We conclude with vital recommendations for ERH context mining researchers and propose an uptake roadmap with guidelines for researchers, industries, and governments to enable a safer cyberspace.  ( 2 min )
    Adversarial Learning for Implicit Semantic-Aware Communications. (arXiv:2301.11589v1 [cs.LG])
    Semantic communication is a novel communication paradigm that focuses on recognizing and delivering the desired meaning of messages to the destination users. Most existing works in this area focus on delivering explicit semantics, labels or signal features that can be directly identified from the source signals. In this paper, we consider the implicit semantic communication problem in which hidden relations and closely related semantic terms that cannot be recognized from the source signals need to also be delivered to the destination user. We develop a novel adversarial learning-based implicit semantic-aware communication (iSAC) architecture in which the source user, instead of maximizing the total amount of information transmitted to the channel, aims to help the recipient learn an inference rule that can automatically generate implicit semantics based on limited clue information. We prove that by applying iSAC, the destination user can always learn an inference rule that matches the true inference rule of the source messages. Experimental results show that the proposed iSAC can offer up to a 19.69 dB improvement over existing non-inferential communication solutions, in terms of symbol error rate at the destination user.  ( 2 min )
    Can We Use Probing to Better Understand Fine-tuning and Knowledge Distillation of the BERT NLU?. (arXiv:2301.11688v1 [cs.CL])
    In this article, we use probing to investigate phenomena that occur during fine-tuning and knowledge distillation of a BERT-based natural language understanding (NLU) model. Our ultimate purpose was to use probing to better understand practical production problems and consequently to build better NLU models. We designed experiments to see how fine-tuning changes the linguistic capabilities of BERT, what the optimal size of the fine-tuning dataset is, and what amount of information is contained in a distilled NLU based on a tiny Transformer. The results of the experiments show that the probing paradigm in its current form is not well suited to answer such questions. Structural, Edge and Conditional probes do not take into account how easy it is to decode probed information. Consequently, we conclude that quantification of information decodability is critical for many practical applications of the probing paradigm.  ( 2 min )
    Online Learning in Stackelberg Games with an Omniscient Follower. (arXiv:2301.11518v1 [cs.LG])
    We study the problem of online learning in a two-player decentralized cooperative Stackelberg game. In each round, the leader first takes an action, followed by the follower who takes their action after observing the leader's move. The goal of the leader is to learn to minimize the cumulative regret based on the history of interactions. Differing from the traditional formulation of repeated Stackelberg games, we assume the follower is omniscient, with full knowledge of the true reward, and that they always best-respond to the leader's actions. We analyze the sample complexity of regret minimization in this repeated Stackelberg game. We show that depending on the reward structure, the existence of the omniscient follower may change the sample complexity drastically, from constant to exponential, even for linear cooperative Stackelberg games. This poses unique challenges for the learning process of the leader and the subsequent regret analysis.
    Neural Wasserstein Gradient Flows for Maximum Mean Discrepancies with Riesz Kernels. (arXiv:2301.11624v1 [cs.LG])
    Wasserstein gradient flows of maximum mean discrepancy (MMD) functionals with non-smooth Riesz kernels show a rich structure as singular measures can become absolutely continuous ones and conversely. In this paper we contribute to the understanding of such flows. We propose to approximate the backward scheme of Jordan, Kinderlehrer and Otto for computing such Wasserstein gradient flows as well as a forward scheme for so-called Wasserstein steepest descent flows by neural networks (NNs). Since we cannot restrict ourselves to absolutely continuous measures, we have to deal with transport plans and velocity plans instead of usual transport maps and velocity fields. Indeed, we approximate the disintegration of both plans by generative NNs which are learned with respect to appropriate loss functions. In order to evaluate the quality of both neural schemes, we benchmark them on the interaction energy. Here we provide analytic formulas for Wasserstein schemes starting at a Dirac measure and show their convergence as the time step size tends to zero. Finally, we illustrate our neural MMD flows by numerical examples.
    SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning. (arXiv:2301.11520v1 [cs.LG])
    As previous representations for reinforcement learning cannot effectively incorporate a human-intuitive understanding of the 3D environment, they usually suffer from sub-optimal performances. In this paper, we present Semantic-aware Neural Radiance Fields for Reinforcement Learning (SNeRL), which jointly optimizes semantic-aware neural radiance fields (NeRF) with a convolutional encoder to learn 3D-aware neural implicit representation from multi-view images. We introduce 3D semantic and distilled feature fields in parallel to the RGB radiance fields in NeRF to learn semantic and object-centric representation for reinforcement learning. SNeRL outperforms not only previous pixel-based representations but also recent 3D-aware representations both in model-free and model-based reinforcement learning.  ( 2 min )
    PLay: Parametrically Conditioned Layout Generation using Latent Diffusion. (arXiv:2301.11529v1 [cs.LG])
    Layout design is an important task in various design fields, including user interfaces, document, and graphic design. As this task requires tedious manual effort by designers, prior works have attempted to automate this process using generative models, but commonly fell short of providing intuitive user controls and achieving design objectives. In this paper, we build a conditional latent diffusion model, PLay, that generates parametrically conditioned layouts in vector graphic space from user-specified guidelines, which are commonly used by designers for representing their design intents in current practices. Our method outperforms prior works across three datasets on metrics including FID and FD-VG, and in user test. Moreover, it brings a novel and interactive experience to professional layout design processes.  ( 2 min )
    Improving deep learning precipitation nowcasting by using prior knowledge. (arXiv:2301.11707v1 [cs.LG])
    Deep learning methods dominate short-term high-resolution precipitation nowcasting in terms of prediction error. However, their operational usability is limited by difficulties explaining dynamics behind the predictions, which are smoothed out and missing the high-frequency features due to optimizing for mean error loss functions. We experiment with hand-engineering of the advection-diffusion differential equation into a PhyCell to introduce more accurate physical prior to a PhyDNet model that disentangles physical and residual dynamics. Results indicate that while PhyCell can learn the intended dynamics, training of PhyDNet remains driven by loss optimization, resulting in a model with the same prediction capabilities.
    Class-Incremental Learning with Repetition. (arXiv:2301.11396v1 [cs.LG])
    Real-world data streams naturally include the repetition of previous concepts. From a Continual Learning (CL) perspective, repetition is a property of the environment and, unlike replay, cannot be controlled by the user. Nowadays, Class-Incremental scenarios represent the leading test-bed for assessing and comparing CL strategies. This family of scenarios is very easy to use, but it never allows revisiting previously seen classes, thus completely disregarding the role of repetition. We focus on the family of Class-Incremental with Repetition (CIR) scenarios, where repetition is embedded in the definition of the stream. We propose two stochastic scenario generators that produce a wide range of CIR scenarios starting from a single dataset and a few control parameters. We conduct the first comprehensive evaluation of repetition in CL by studying the behavior of existing CL strategies under different CIR scenarios. We then present a novel replay strategy that exploits repetition and counteracts the natural imbalance present in the stream. On both CIFAR100 and TinyImageNet, our strategy outperforms other replay approaches, which are not designed for environments with repetition.  ( 2 min )
    PhysGraph: Physics-Based Integration Using Graph Neural Networks. (arXiv:2301.11841v1 [cs.GR])
    Physics-based simulation of mesh based domains remains a challenging task. State-of-the-art techniques can produce realistic results but require expert knowledge. A major bottleneck in many approaches is the step of integrating a potential energy in order to compute velocities or displacements. Recently, learning based method for physics-based simulation have sparked interest with graph based approaches being a promising research direction. One of the challenges for these methods is to generate models that are mesh independent and generalize to different material properties. Moreover, the model should also be able to react to unforeseen external forces like ubiquitous collisions. Our contribution is based on a simple observation: evaluating forces is computationally relatively cheap for traditional simulation methods and can be computed in parallel in contrast to their integration. If we learn how a system reacts to forces in general, irrespective of their origin, we can learn an integrator that can predict state changes due to the total forces with high generalization power. We effectively factor out the physical model behind resulting forces by relying on an opaque force module. We demonstrate that this idea leads to a learnable module that can be trained on basic internal forces of small mesh patches and generalizes to different mesh typologies, resolutions, material parameters and unseen forces like collisions at inference time. Our proposed paradigm is general and can be used to model a variety of physical phenomena. We focus our exposition on the detail enhancement of coarse clothing geometry which has many applications including computer games, virtual reality and virtual try-on.  ( 2 min )
    BOMP-NAS: Bayesian Optimization Mixed Precision NAS. (arXiv:2301.11810v1 [cs.LG])
    Bayesian Optimization Mixed-Precision Neural Architecture Search (BOMP-NAS) is an approach to quantization-aware neural architecture search (QA-NAS) that leverages both Bayesian optimization (BO) and mixed-precision quantization (MP) to efficiently search for compact, high performance deep neural networks. The results show that integrating quantization-aware fine-tuning (QAFT) into the NAS loop is a necessary step to find networks that perform well under low-precision quantization: integrating it allows a model size reduction of nearly 50\% on the CIFAR-10 dataset. BOMP-NAS is able to find neural networks that achieve state of the art performance at much lower design costs. This study shows that BOMP-NAS can find these neural networks at a 6x shorter search time compared to the closest related work.  ( 2 min )
    Feature space exploration as an alternative for design space exploration beyond the parametric space. (arXiv:2301.11416v1 [cs.LG])
    This paper compares the parametric design space with a feature space generated by the extraction of design features using deep learning (DL) as an alternative way for design space exploration. In this comparison, the parametric design space is constructed by creating a synthetic dataset of 15.000 elements using a parametric algorithm and reducing its dimensions for visualization. The feature space - reduced-dimensionality vector space of embedded data features - is constructed by training a DL model on the same dataset. We analyze and compare the extracted design features by reducing their dimension and visualizing the results. We demonstrate that parametric design space is narrow in how it describes the design solutions because it is based on the combination of individual parameters. In comparison, we observed that the feature design space can intuitively represent design solutions according to complex parameter relationships. Based on our results, we discuss the potential of translating the features learned by DL models to provide a mechanism for intuitive design exploration space and visualization of possible design solutions.
    Learning the Dynamics of Sparsely Observed Interacting Systems. (arXiv:2301.11647v1 [stat.ML])
    We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.  ( 2 min )
    DBGDGM: Dynamic Brain Graph Deep Generative Model. (arXiv:2301.11408v1 [cs.LG])
    Graphs are a natural representation of brain activity derived from functional magnetic imaging (fMRI) data. It is well known that clusters of anatomical brain regions, known as functional connectivity networks (FCNs), encode temporal relationships which can serve as useful biomarkers for understanding brain function and dysfunction. Previous works, however, ignore the temporal dynamics of the brain and focus on static graphs. In this paper, we propose a dynamic brain graph deep generative model (DBGDGM) which simultaneously clusters brain regions into temporally evolving communities and learns dynamic unsupervised node embeddings. Specifically, DBGDGM represents brain graph nodes as embeddings sampled from a distribution over communities that evolve over time. We parameterise this community distribution using neural networks that learn from subject and node embeddings as well as past community assignments. Experiments demonstrate DBGDGM outperforms baselines in graph generation, dynamic link prediction, and is comparable for graph classification. Finally, an analysis of the learnt community distributions reveals overlap with known FCNs reported in neuroscience literature.  ( 2 min )
    Nik Defense: An Artificial Intelligence Based Defense Mechanism against Selfish Mining in Bitcoin. (arXiv:2301.11463v1 [cs.CR])
    The Bitcoin cryptocurrency has received much attention recently. In the network of Bitcoin, transactions are recorded in a ledger. In this network, the process of recording transactions depends on some nodes called miners that execute a protocol known as mining protocol. One of the significant aspects of mining protocol is incentive compatibility. However, literature has shown that Bitcoin mining's protocol is not incentive-compatible. Some nodes with high computational power can obtain more revenue than their fair share by adopting a type of attack called the selfish mining attack. In this paper, we propose an artificial intelligence-based defense against selfish mining attacks by applying the theory of learning automata. The proposed defense mechanism ignores private blocks by assigning weight based on block discovery time and changes current Bitcoin's fork resolving policy by evaluating branches' height difference in a self-adaptive manner utilizing learning automata. To the best of our knowledge, the proposed protocol is the literature's first learning-based defense mechanism. Simulation results have shown the superiority of the proposed mechanism against tie-breaking mechanism, which is a well-known defense. The simulation results have shown that the suggested defense mechanism increases the profit threshold up to 40\% and decreases the revenue of selfish attackers.  ( 2 min )
    Limitless stability for Graph Convolutional Networks. (arXiv:2301.11443v1 [cs.LG])
    This work establishes rigorous, novel and widely applicable stability guarantees and transferability bounds for graph convolutional networks -- without reference to any underlying limit object or statistical distribution. Crucially, utilized graph-shift operators (GSOs) are not necessarily assumed to be normal, allowing for the treatment of networks on both directed- and for the first time also undirected graphs. Stability to node-level perturbations is related to an 'adequate (spectral) covering' property of the filters in each layer. Stability to edge-level perturbations is related to Lipschitz constants and newly introduced semi-norms of filters. Results on stability to topological perturbations are obtained through recently developed mathematical-physics based tools. As an important and novel example, it is showcased that graph convolutional networks are stable under graph-coarse-graining procedures (replacing strongly-connected sub-graphs by single nodes) precisely if the GSO is the graph Laplacian and filters are regular at infinity. These new theoretical results are supported by corresponding numerical investigations.  ( 2 min )
    Learning Informative Representation for Fairness-aware Multivariate Time-series Forecasting: A Group-based Perspective. (arXiv:2301.11535v1 [cs.LG])
    Multivariate time series (MTS) forecasting has penetrated and benefited our daily life. However, the unfair forecasting of MTSs not only degrades their practical benefit but even brings about serious potential risk. Such unfair MTS forecasting may be attributed to variable disparity leading to advantaged and disadvantaged variables. This issue has rarely been studied in the existing MTS forecasting models. To address this significant gap, we formulate the MTS fairness modeling problem as learning informative representations attending to both advantaged and disadvantaged variables. Accordingly, we propose a novel framework, named FairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial learning to generate both group-irrelevant and -relevant representations for the downstream forecasting. FairFor first adopts the recurrent graph convolution to capture spatio-temporal variable correlations and to group variables by leveraging a spectral relaxation of the K-means objective. Then, it utilizes a novel filtering & fusion module to filter the group-relevant information and generate group-irrelevant representations by orthogonality regularization. The group-irrelevant and -relevant representations form highly informative representations, facilitating to share the knowledge from advantaged variables to disadvantaged variables and guarantee fairness. Extensive experiments on four public datasets demonstrate the FairFor effectiveness for fair forecasting and significant performance improvement.  ( 2 min )
    Policy Optimization with Robustness Certificates. (arXiv:2301.11374v1 [cs.LG])
    We present a policy optimization framework in which the learned policy comes with a machine-checkable certificate of adversarial robustness. Our approach, called CAROL, learns a model of the environment. In each learning iteration, it uses the current version of this model and an external abstract interpreter to construct a differentiable signal for provable robustness. This signal is used to guide policy learning, and the abstract interpretation used to construct it directly leads to the robustness certificate returned at convergence. We give a theoretical analysis that bounds the worst-case accumulative reward of CAROL. We also experimentally evaluate CAROL on four MuJoCo environments. On these tasks, which involve continuous state and action spaces, CAROL learns certified policies that have performance comparable to the (non-certified) policies learned using state-of-the-art robust RL methods.  ( 2 min )
    Diffusion Denoising for Low-Dose-CT Model. (arXiv:2301.11482v1 [eess.IV])
    Low-dose Computed Tomography (LDCT) reconstruction is an important task in medical image analysis. Recent years have seen many deep learning based methods, proved to be effective in this area. However, these methods mostly follow a supervised architecture, which needs paired CT image of full dose and quarter dose, and the solution is highly dependent on specific measurements. In this work, we introduce Denoising Diffusion LDCT Model, dubbed as DDLM, generating noise-free CT image using conditioned sampling. DDLM uses pretrained model, and need no training nor tuning process, thus our proposal is in unsupervised manner. Experiments on LDCT images have shown comparable performance of DDLM using less inference time, surpassing other state-of-the-art methods, proving both accurate and efficient. Implementation code will be set to public soon.  ( 2 min )
    Rigid body flows for sampling molecular crystal structures. (arXiv:2301.11355v1 [cs.LG])
    Normalizing flows (NF) are a class of powerful generative models that have gained popularity in recent years due to their ability to model complex distributions with high flexibility and expressiveness. In this work, we introduce a new type of normalizing flow that is tailored for modeling positions and orientations of multiple objects in three-dimensional space, such as molecules in a crystal. Our approach is based on two key ideas: first, we define smooth and expressive flows on the group of unit quaternions, which allows us to capture the continuous rotational motion of rigid bodies; second, we use the double cover property of unit quaternions to define a proper density on the rotation group. This ensures that our model can be trained using standard likelihood-based methods or variational inference with respect to a thermodynamic target density. We evaluate the method by training Boltzmann generators for two molecular examples, namely the multi-modal density of a tetrahedral system in an external field and the ice XI phase in the TIP4P-Ew water model. Our flows can be combined with flows operating on the internal degrees of freedom of molecules, and constitute an important step towards the modeling of distributions of many interacting molecules.  ( 2 min )
    Alien Coding. (arXiv:2301.11479v1 [cs.AI])
    We introduce a self-learning algorithm for synthesizing programs for OEIS sequences. The algorithm starts from scratch initially generating programs at random. Then it runs many iterations of a self-learning loop that interleaves (i) training neural machine translation to learn the correspondence between sequences and the programs discovered so far, and (ii) proposing many new programs for each OEIS sequence by the trained neural machine translator. The algorithm discovers on its own programs for more than 78000 OEIS sequences, sometimes developing unusual programming methods. We analyze its behavior and the invented programs in several experiments.  ( 2 min )
    Exploring Deep Reinforcement Learning for Holistic Smart Building Control. (arXiv:2301.11510v1 [cs.LG])
    In this paper, we take a holistic approach to deal with the tradeoffs between energy use and comfort in commercial buildings. We developed a system called OCTOPUS, which employs a novel deep reinforcement learning (DRL) framework that uses a data-driven approach to find the optimal control sequences of all building's subsystems, including HVAC, lighting, blind and window systems. The DRL architecture includes a novel reward function that allows the framework to explore the tradeoffs between energy use and users' comfort, while at the same time enabling the solution of the high-dimensional control problem due to the interactions of four different building subsystems. In order to cope with OCTOPUS's data training requirements, we argue that calibrated simulations that match the target building operational points are the vehicle to generate enough data to be able to train our DRL framework to find the control solution for the target building. In our work, we trained OCTOPUS with 10-year weather data and a building model that is implemented in the EnergyPlus building simulator, which was calibrated using data from a real production building. Through extensive simulations, we demonstrate that OCTOPUS can achieve 14.26% and 8.1% energy savings compared with the state-of-the-art rule-based method in a LEED Gold Certified building and the latest DRL-based method available in the literature respectively, while maintaining human comfort within a desired range.  ( 2 min )
    Machine Learning Approach and Extreme Value Theory to Correlated Stochastic Time Series with Application to Tree Ring Data. (arXiv:2301.11488v1 [stat.ML])
    The main goal of machine learning (ML) is to study and improve mathematical models which can be trained with data provided by the environment to infer the future and to make decisions without necessarily having complete knowledge of all influencing elements. In this work, we describe how ML can be a powerful tool in studying climate modeling. Tree ring growth was used as an implementation in different aspects, for example, studying the history of buildings and environment. By growing and via the time, a new layer of wood to beneath its bark by the tree. After years of growing, time series can be applied via a sequence of tree ring widths. The purpose of this paper is to use ML algorithms and Extreme Value Theory in order to analyse a set of tree ring widths data from nine trees growing in Nottinghamshire. Initially, we start by exploring the data through a variety of descriptive statistical approaches. Transforming data is important at this stage to find out any problem in modelling algorithm. We then use algorithm tuning and ensemble methods to improve the k-nearest neighbors (KNN) algorithm. A comparison between the developed method in this study ad other methods are applied. Also, extreme value of the dataset will be more investigated. The results of the analysis study show that the ML algorithms in the Random Forest method would give accurate results in the analysis of tree ring widths data from nine trees growing in Nottinghamshire with the lowest Root Mean Square Error value. Also, we notice that as the assumed ARMA model parameters increased, the probability of selecting the true model also increased. In terms of the Extreme Value Theory, the Weibull distribution would be a good choice to model tree ring data.  ( 2 min )
    Neural networks learn to magnify areas near decision boundaries. (arXiv:2301.11375v1 [cs.LG])
    We study how training molds the Riemannian geometry induced by neural network feature maps. At infinite width, neural networks with random parameters induce highly symmetric metrics on input space. Feature learning in networks trained to perform classification tasks magnifies local areas along decision boundaries. These changes are consistent with previously proposed geometric approaches for hand-tuning of kernel methods to improve generalization.  ( 2 min )
    Voting from Nearest Tasks: Meta-Vote Pruning of Pre-trained Models for Downstream Tasks. (arXiv:2301.11560v1 [cs.LG])
    As a few large-scale pre-trained models become the major choices of various applications, new challenges arise for model pruning, e.g., can we avoid pruning the same model from scratch for every downstream task? How to reuse the pruning results of previous tasks to accelerate the pruning for a new task? To address these challenges, we create a small model for a new task from the pruned models of similar tasks. We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task. We study this ''meta-pruning'' from nearest tasks on two major classes of pre-trained models, convolutional neural network (CNN) and vision transformer (ViT), under a limited budget of pruning iterations. Our study begins by investigating the overlap of pruned models for similar tasks and how the overlap changes over different layers and blocks. Inspired by these discoveries, we develop a simple but effective ''Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task by initializing a sub-network from the pruned models of its nearest tasks. In experiments, we demonstrate MVP's advantages in accuracy, efficiency, and generalization through extensive empirical studies and comparisons with popular pruning methods over several datasets.  ( 2 min )
    Model-based Offline Reinforcement Learning with Local Misspecification. (arXiv:2301.11426v1 [cs.LG])
    We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to jointly consider selecting over dynamics models and policies: as long as a dynamics model can accurately represent the dynamics of the state-action pairs visited by a given policy, it is possible to approximate the value of that particular policy. We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks.  ( 2 min )
    Direct Parameterization of Lipschitz-Bounded Deep Networks. (arXiv:2301.11526v1 [cs.LG])
    This paper introduces a new parameterization of deep neural networks (both fully-connected and convolutional) with guaranteed Lipschitz bounds, i.e. limited sensitivity to perturbations. The Lipschitz guarantees are equivalent to the tightest-known bounds based on certification via a semidefinite program (SDP), which does not scale to large models. In contrast to the SDP approach, we provide a ``direct'' parameterization, i.e. a smooth mapping from $\mathbb R^N$ onto the set of weights of Lipschitz-bounded networks. This enables training via standard gradient methods, without any computationally intensive projections or barrier terms. The new parameterization can equivalently be thought of as either a new layer type (the \textit{sandwich layer}), or a novel parameterization of standard feedforward networks with parameter sharing between neighbouring layers. We illustrate the method with some applications in image classification (MNIST and CIFAR-10).  ( 2 min )
    Graph Scattering beyond Wavelet Shackles. (arXiv:2301.11456v1 [cs.LG])
    This work develops a flexible and mathematically sound framework for the design and analysis of graph scattering networks with variable branching ratios and generic functional calculus filters. Spectrally-agnostic stability guarantees for node- and graph-level perturbations are derived; the vertex-set non-preserving case is treated by utilizing recently developed mathematical-physics based tools. Energy propagation through the network layers is investigated and related to truncation stability. New methods of graph-level feature aggregation are introduced and stability of the resulting composite scattering architectures is established. Finally, scattering transforms are extended to edge- and higher order tensorial input. Theoretical results are complemented by numerical investigations: Suitably chosen cattering networks conforming to the developed theory perform better than traditional graph-wavelet based scattering approaches in social network graph classification tasks and significantly outperform other graph-based learning approaches to regression of quantum-chemical energies on QM7.  ( 2 min )
    MG-GNN: Multigrid Graph Neural Networks for Learning Multilevel Domain Decomposition Methods. (arXiv:2301.11378v1 [cs.LG])
    Domain decomposition methods (DDMs) are popular solvers for discretized systems of partial differential equations (PDEs), with one-level and multilevel variants. These solvers rely on several algorithmic and mathematical parameters, prescribing overlap, subdomain boundary conditions, and other properties of the DDM. While some work has been done on optimizing these parameters, it has mostly focused on the one-level setting or special cases such as structured-grid discretizations with regular subdomain construction. In this paper, we propose multigrid graph neural networks (MG-GNN), a novel GNN architecture for learning optimized parameters in two-level DDMs\@. We train MG-GNN using a new unsupervised loss function, enabling effective training on small problems that yields robust performance on unstructured grids that are orders of magnitude larger than those in the training set. We show that MG-GNN outperforms popular hierarchical graph network architectures for this optimization and that our proposed loss function is critical to achieving this improved performance.  ( 2 min )
    Rethinking 1x1 Convolutions: Can we train CNNs with Frozen Random Filters?. (arXiv:2301.11360v1 [cs.CV])
    Modern CNNs are learning the weights of vast numbers of convolutional operators. In this paper, we raise the fundamental question if this is actually necessary. We show that even in the extreme case of only randomly initializing and never updating spatial filters, certain CNN architectures can be trained to surpass the accuracy of standard training. By reinterpreting the notion of pointwise ($1\times 1$) convolutions as an operator to learn linear combinations (LC) of frozen (random) spatial filters, we are able to analyze these effects and propose a generic LC convolution block that allows tuning of the linear combination rate. Empirically, we show that this approach not only allows us to reach high test accuracies on CIFAR and ImageNet but also has favorable properties regarding model robustness, generalization, sparsity, and the total number of necessary weights. Additionally, we propose a novel weight sharing mechanism, which allows sharing of a single weight tensor between all spatial convolution layers to massively reduce the number of weights.  ( 2 min )
    Estimating Causal Effects using a Multi-task Deep Ensemble. (arXiv:2301.11351v1 [cs.LG])
    Over the past few decades, a number of methods have been proposed for causal effect estimation, yet few have been demonstrated to be effective in handling data with complex structures, such as images. To fill this gap, we propose a Causal Multi-task Deep Ensemble (CMDE) framework to learn both shared and group-specific information from the study population and prove its equivalence to a multi-task Gaussian process (GP) with coregionalization kernel a priori. Compared to multi-task GP, CMDE efficiently handles high-dimensional and multi-modal covariates and provides pointwise uncertainty estimates of causal effects. We evaluate our method across various types of datasets and tasks and find that CMDE outperforms state-of-the-art methods on a majority of these tasks.  ( 2 min )
    Learning Vortex Dynamics for Fluid Inference and Prediction. (arXiv:2301.11494v1 [cs.LG])
    We propose a novel machine learning method based on differentiable vortex particles to infer and predict fluid dynamics from a single video. The key design of our system is a particle-based latent space to encapsulate the hidden, Lagrangian vortical evolution underpinning the observable, Eulerian flow phenomena. We devise a novel differentiable vortex particle system in conjunction with their learnable, vortex-to-velocity dynamics mapping to effectively capture and represent the complex flow features in a reduced space. We further design an end-to-end training pipeline to directly learn and synthesize simulators from data, that can reliably deliver future video rollouts based on limited observation. The value of our method is twofold: first, our learned simulator enables the inference of hidden physics quantities (e.g. velocity field) purely from visual observation, to be used for motion analysis; secondly, it also supports future prediction, constructing the input video's sequel along with its future dynamics evolution. We demonstrate our method's efficacy by comparing quantitatively and qualitatively with a range of existing methods on both synthetic and real-world videos, displaying improved data correspondence, visual plausibility, and physical integrity.  ( 2 min )
    Revisiting Discriminative Entropy Clustering and its relation to K-means. (arXiv:2301.11405v1 [cs.LG])
    Maximization of mutual information between the model's input and output is formally related to "decisiveness" and "fairness" of the softmax predictions, motivating such unsupervised entropy-based losses for discriminative neural networks. Recent self-labeling methods based on such losses represent the state of the art in deep clustering. However, some important properties of entropy clustering are not well-known, or even misunderstood. For example, we provide a counterexample to prior claims about equivalence to variance clustering (K-means) and point out technical mistakes in such theories. We discuss the fundamental differences between these discriminative and generative clustering approaches. Moreover, we show the susceptibility of standard entropy clustering to narrow margins and motivate an explicit margin maximization term. We also propose an improved self-labeling loss; it is robust to pseudo-labeling errors and enforces stronger fairness. We develop an EM algorithm for our loss that is significantly faster than the standard alternatives. Our results improve the state-of-the-art on standard benchmarks.  ( 2 min )
    Multi-limb Split Learning for Tumor Classification on Vertically Distributed Data. (arXiv:2301.11468v1 [eess.IV])
    Brain tumors are one of the life-threatening forms of cancer. Previous studies have classified brain tumors using deep neural networks. In this paper, we perform the later task using a collaborative deep learning technique, more specifically split learning. Split learning allows collaborative learning via neural networks splitting into two (or more) parts, a client-side network and a server-side network. The client-side is trained to a certain layer called the cut layer. Then, the rest of the training is resumed on the server-side network. Vertical distribution, a method for distributing data among organizations, was implemented where several hospitals hold different attributes of information for the same set of patients. To the best of our knowledge this paper will be the first paper to implement both split learning and vertical distribution for brain tumor classification. Using both techniques, we were able to achieve train and test accuracy greater than 90\% and 70\%, respectively.  ( 2 min )
    Are Equivariant Equilibrium Approximators Beneficial?. (arXiv:2301.11481v1 [cs.GT])
    Recently, remarkable progress has been made by approximating Nash equilibrium (NE), correlated equilibrium (CE), and coarse correlated equilibrium (CCE) through function approximation that trains a neural network to predict equilibria from game representations. Furthermore, equivariant architectures are widely adopted in designing such equilibrium approximators in normal-form games. In this paper, we theoretically characterize benefits and limitations of equivariant equilibrium approximators. For the benefits, we show that they enjoy better generalizability than general ones and can achieve better approximations when the payoff distribution is permutation-invariant. For the limitations, we discuss their drawbacks in terms of equilibrium selection and social welfare. Together, our results help to understand the role of equivariance in equilibrium approximators.  ( 2 min )
    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing. (arXiv:2301.11500v1 [cs.LG])
    It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019): GD sequentially learns solutions with increasing ranks until it recovers the ground truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result provides characterizations for the whole learning process. Moreover, besides the over-parameterized regime that many prior works focused on, our analysis of the incremental learning procedure also applies to the under-parameterized regime. Finally, we conduct numerical experiments to confirm our theoretical findings.  ( 2 min )
    Learning Modulo Theories. (arXiv:2301.11435v1 [cs.LG])
    Recent techniques that integrate \emph{solver layers} into Deep Neural Networks (DNNs) have shown promise in bridging a long-standing gap between inductive learning and symbolic reasoning techniques. In this paper we present a set of techniques for integrating \emph{Satisfiability Modulo Theories} (SMT) solvers into the forward and backward passes of a deep network layer, called SMTLayer. Using this approach, one can encode rich domain knowledge into the network in the form of mathematical formulas. In the forward pass, the solver uses symbols produced by prior layers, along with these formulas, to construct inferences; in the backward pass, the solver informs updates to the network, driving it towards representations that are compatible with the solver's theory. Notably, the solver need not be differentiable. We implement \layername as a Pytorch module, and our empirical results show that it leads to models that \emph{1)} require fewer training samples than conventional models, \emph{2)} that are robust to certain types of covariate shift, and \emph{3)} that ultimately learn representations that are consistent with symbolic knowledge, and thus naturally interpretable.  ( 2 min )
    Causal Structural Learning from Time Series: A Convex Optimization Approach. (arXiv:2301.11336v1 [cs.LG])
    Structural learning, which aims to learn directed acyclic graphs (DAGs) from observational data, is foundational to causal reasoning and scientific discovery. Recent advancements formulate structural learning into a continuous optimization problem; however, DAG learning remains a highly non-convex problem, and there has not been much work on leveraging well-developed convex optimization techniques for causal structural learning. We fill this gap by proposing a data-adaptive linear approach for causal structural learning from time series data, which can be conveniently cast into a convex optimization problem using a recently developed monotone operator variational inequality (VI) formulation. Furthermore, we establish non-asymptotic recovery guarantee of the VI-based approach and show the superior performance of our proposed method on structure recovery over existing methods via extensive numerical experiments.  ( 2 min )
    A Simple Algorithm For Scaling Up Kernel Methods. (arXiv:2301.11414v1 [cs.LG])
    The recent discovery of the equivalence between infinitely wide neural networks (NNs) in the lazy training regime and Neural Tangent Kernels (NTKs) (Jacot et al., 2018) has revived interest in kernel methods. However, conventional wisdom suggests kernel methods are unsuitable for large samples due to their computational complexity and memory requirements. We introduce a novel random feature regression algorithm that allows us (when necessary) to scale to virtually infinite numbers of random features. We illustrate the performance of our method on the CIFAR-10 dataset.  ( 2 min )
    Coincident Learning for Unsupervised Anomaly Detection. (arXiv:2301.11368v1 [cs.LG])
    Anomaly detection is an important task for complex systems (e.g., industrial facilities, manufacturing, large-scale science experiments), where failures in a sub-system can lead to low yield, faulty products, or even damage to components. While complex systems often have a wealth of data, labeled anomalies are typically rare (or even nonexistent) and expensive to acquire. In this paper, we introduce a new method, called CoAD, for training anomaly detection models on unlabeled data, based on the expectation that anomalous behavior in one sub-system will produce coincident anomalies in downstream sub-systems and products. Given data split into two streams $s$ and $q$ (i.e., subsystem diagnostics and final product quality), we define an unsupervised metric, $\hat{F}_\beta$, out of analogy to the supervised classification $F_\beta$ statistic, which quantifies the performance of the independent anomaly detection algorithms on s and q based on their coincidence rate. We demonstrate our method in four cases: a synthetic time-series data set, a synthetic imaging data set generated from MNIST, a metal milling data set, and a data set taken from a particle accelerator.  ( 2 min )
    Causal Bandits without Graph Learning. (arXiv:2301.11401v1 [stat.ML])
    We study the causal bandit problem when the causal graph is unknown and develop an efficient algorithm for finding the parent node of the reward node using atomic interventions. We derive the exact equation for the expected number of interventions performed by the algorithm and show that under certain graphical conditions it could perform either logarithmically fast or, under more general assumptions, slower but still sublinearly in the number of variables. We formally show that our algorithm is optimal as it meets the universal lower bound we establish for any algorithm that performs atomic interventions. Finally, we extend our algorithm to the case when the reward node has multiple parents. Using this algorithm together with a standard algorithm from bandit literature leads to improved regret bounds.  ( 2 min )
    A Hybrid Deep Neural Operator/Finite Element Method for Ice-Sheet Modeling. (arXiv:2301.11402v1 [physics.comp-ph])
    One of the most challenging and consequential problems in climate modeling is to provide probabilistic projections of sea level rise. A large part of the uncertainty of sea level projections is due to uncertainty in ice sheet dynamics. At the moment, accurate quantification of the uncertainty is hindered by the cost of ice sheet computational models. In this work, we develop a hybrid approach to approximate existing ice sheet computational models at a fraction of their cost. Our approach consists of replacing the finite element model for the momentum equations for the ice velocity, the most expensive part of an ice sheet model, with a Deep Operator Network, while retaining a classic finite element discretization for the evolution of the ice thickness. We show that the resulting hybrid model is very accurate and it is an order of magnitude faster than the traditional finite element model. Further, a distinctive feature of the proposed model compared to other neural network approaches, is that it can handle high-dimensional parameter spaces (parameter fields) such as the basal friction at the bed of the glacier, and can therefore be used for generating samples for uncertainty quantification. We study the impact of hyper-parameters, number of unknowns and correlation length of the parameter distribution on the training and accuracy of the Deep Operator Network on a synthetic ice sheet model. We then target the evolution of the Humboldt glacier in Greenland and show that our hybrid model can provide accurate statistics of the glacier mass loss and can be effectively used to accelerate the quantification of uncertainty.  ( 2 min )
  • Open

    Differential Privacy has Bounded Impact on Fairness in Classification. (arXiv:2210.16242v2 [cs.LG] UPDATED)
    We theoretically study the impact of differential privacy on fairness in classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on accuracy conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.
    Achieving Risk Control in Online Learning Settings. (arXiv:2205.09095v7 [cs.LG] UPDATED)
    To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk -- such as coverage of confidence intervals, false negative rate, or F1 score -- in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion. The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost. We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks. To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks. Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v3 [stat.ML] UPDATED)
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.
    Big portfolio selection by graph-based conditional moments method. (arXiv:2301.11697v1 [stat.ML])
    How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which guides the learning procedure through a factor-hypergraph built by the set of stock-to-stock relations from the domain knowledge as well as the set of factor-to-stock relations from the asset pricing knowledge. Next, the GRACE method learns the conditional variance, skewness, and kurtosis of stock returns from the learned conditional quantiles by using the quantiled conditional moment (QCM) method. The QCM method is a supervised learning procedure to learn these conditional higher-order moments, so it largely overcomes the computational difficulty from the classical high-dimensional GARCH-type methods. Moreover, the QCM method allows the mis-specification in modeling conditional quantiles to some extent, due to its regression-based nature. Finally, the GRACE method uses the learned conditional mean, variance, skewness, and kurtosis to construct several performance measures, which are criteria to sort the stocks to proceed the portfolio selection in the well-known 10-decile framework. An application to NASDAQ and NYSE stock markets shows that the GRACE method performs much better than its competitors, particularly when the performance measures are comprised of conditional variance, skewness, and kurtosis.  ( 2 min )
    Personalised Federated Learning On Heterogeneous Feature Spaces. (arXiv:2301.11447v1 [cs.LG])
    Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.  ( 2 min )
    Fine-tuning Neural-Operator architectures for training and generalization. (arXiv:2301.11509v1 [cs.LG])
    In this work, we present an analysis of the generalization of Neural Operators (NOs) and derived architectures. We proposed a family of networks, which we name (${\textit{s}}{\text{NO}}+\varepsilon$), where we modify the layout of NOs towards an architecture resembling a Transformer; mainly, we substitute the Attention module with the Integral Operator part of NOs. The resulting network preserves universality, has a better generalization to unseen data, and similar number of parameters as NOs. On the one hand, we study numerically the generalization by gradually transforming NOs into ${\textit{s}}{\text{NO}}+\varepsilon$ and verifying a reduction of the test loss considering a time-harmonic wave dataset with different frequencies. We perform the following changes in NOs: (a) we split the Integral Operator (non-local) and the (local) feed-forward network (MLP) into different layers, generating a {\it sequential} structure which we call sequential Neural Operator (${\textit{s}}{\text{NO}}$), (b) we add the skip connection, and layer normalization in ${\textit{s}}{\text{NO}}$, and (c) we incorporate dropout and stochastic depth that allows us to generate deep networks. In each case, we observe a decrease in the test loss in a wide variety of initialization, indicating that our changes outperform the NO. On the other hand, building on infinite-dimensional Statistics, and in particular the Dudley Theorem, we provide bounds of the Rademacher complexity of NOs and ${\textit{s}}{\text{NO}}$, and we find the following relationship: the upper bound of the Rademacher complexity of the ${\textit{s}}{\text{NO}}$ is a lower-bound of the NOs, thereby, the generalization error bound of ${\textit{s}}{\text{NO}}$ is smaller than NO, which further strengthens our numerical results.  ( 2 min )
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v3 [cs.LG] UPDATED)
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.  ( 2 min )
    SOBER: Scalable Batch Bayesian Optimization and Quadrature using Recombination Constraints. (arXiv:2301.11832v1 [cs.LG])
    Batch Bayesian optimisation (BO) has shown to be a sample-efficient method of performing optimisation where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch BO with arbitrary acquisition functions, arbitrary input spaces (e.g. graph), and arbitrary kernels. The key to our approach is to reformulate batch selection for BO as a Bayesian quadrature (BQ) problem, which offers computational advantages. This reformulation is beneficial in solving BQ tasks reciprocally, which introduces the exploitative functionality of BO to BQ. We show that SOBER offers substantive performance gains in synthetic and real-world tasks, including drug discovery and simulation-based inference.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v3 [stat.ML] UPDATED)
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    Rethinking Assumptions in Deep Anomaly Detection. (arXiv:2006.00339v3 [cs.LG] UPDATED)
    Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) it is usually treated in an unsupervised manner since one typically does not have access to, or it is infeasible to utilize, a dataset that sufficiently characterizes what it means to be "anomalous." In this paper we present results demonstrating that this intuition surprisingly seems not to extend to deep AD on images. For a recent AD benchmark on ImageNet, classifiers trained to discern between normal samples and just a few (64) random natural images are able to outperform the current state of the art in deep AD. Experimentally we discover that the multiscale structure of image data makes example anomalies exceptionally informative.  ( 2 min )
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v2 [cs.LG] UPDATED)
    Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in its most practical form. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. Existing analysis cannot conclude the convergence for such a challenging case. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. We develop a novel framework that evaluates and controls the error propagation between actor and critic systematically. To our knowledge, this is the first finite-time analysis for the online single-timescale actor-critic method. Our results compare favorably to the existing literature in terms of considering the most practical yet challenging settings and requiring weaker assumptions.  ( 2 min )
    Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions. (arXiv:2301.11885v1 [stat.ML])
    Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.  ( 2 min )
    Conformal inference is (almost) free for neural networks trained with early stopping. (arXiv:2301.11556v1 [stat.ML])
    Early stopping based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stopping often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses the above limitation with conformalized early stopping: a novel method that combines early stopping with conformal calibration while efficiently recycling the same hold-out data. This leads to models that are both accurate and able to provide exact predictive inferences without multiple data splits nor overly conservative adjustments. Practical implementations are developed for different learning tasks -- outlier detection, multi-class classification, regression -- and their competitive performance is demonstrated on real data.  ( 2 min )
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v2 [cs.LG] UPDATED)
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).  ( 2 min )
    Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2209.06620v3 [cs.LG] UPDATED)
    Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.  ( 2 min )
    Statistical Inference for the Dynamic Time Warping Distance, with Application to Abnormal Time-Series Detection. (arXiv:2202.06593v2 [stat.ML] UPDATED)
    We study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too difficult to derive because it is obtained based on the solution of the DTW algorithm, which is complicated. To circumvent this difficulty, we propose to employ the conditional selective inference framework, which enables us to derive a valid inference method on the DTW distance. To our knowledge, this is the first method that can provide a valid p-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making such as abnormal time-series detection problems. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.  ( 2 min )
    Learning the Dynamics of Sparsely Observed Interacting Systems. (arXiv:2301.11647v1 [stat.ML])
    We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.  ( 2 min )
    Feasibility and Transferability of Transfer Learning: A Mathematical Framework. (arXiv:2301.11542v1 [cs.LG])
    Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from previous learning tasks to improve the performance of new ones. Despite its numerous empirical successes, theoretical analysis for transfer learning is limited. In this paper we build for the first time, to the best of our knowledge, a mathematical framework for the general procedure of transfer learning. Our unique reformulation of transfer learning as an optimization problem allows for the first time, analysis of its feasibility. Additionally, we propose a novel concept of transfer risk to evaluate transferability of transfer learning. Our numerical studies using the Office-31 dataset demonstrate the potential and benefits of incorporating transfer risk in the evaluation of transfer learning performance.  ( 2 min )
    Single-Trajectory Distributionally Robust Reinforcement Learning. (arXiv:2301.11721v1 [stat.ML])
    As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.  ( 2 min )
    I Prefer not to Say: Are Users Penalized for Protecting Personal Data?. (arXiv:2210.13954v3 [cs.LG] UPDATED)
    We examine the problem of obtaining fair outcomes for individuals who choose to share optional information with machine-learned models and those who do not consent and keep their data undisclosed. We find that these non-consenting users receive significantly lower prediction outcomes than justified by their provided information alone. This observation gives rise to the overlooked problem of how to ensure that users, who protect their personal data, are not penalized. While statistical fairness notions focus on fair outcomes between advantaged and disadvantaged groups, these fairness notions fail to protect the non-consenting users. To address this problem, we formalize protection requirements for models which (i) allow users to benefit from sharing optional information and (ii) do not penalize them if they keep their data undisclosed. We offer the first solution to this problem by proposing the notion of Optional Feature Fairness (OFF), which we prove to be loss-optimal under our protection requirements (i) and (ii). To learn OFF-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we extensively analyze OFF on a variety of challenging real-world tasks, models, and data sets with multiple optional features.  ( 2 min )
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v2 [math.ST] UPDATED)
    Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.  ( 2 min )
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v3 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.  ( 2 min )
    Convergence of Batch Updating Methods with Approximate Gradients and/or Noisy Measurements: Theory and Computational Results. (arXiv:2209.05372v2 [math.OC] UPDATED)
    In this paper, we present a unified and general framework for analyzing the batch updating approach to nonlinear, high-dimensional optimization. The framework encompasses all the currently used batch updating approaches, and is applicable to nonconvex as well as convex functions. Moreover, the framework permits the use of noise-corrupted gradients, as well as first-order approximations to the gradient (sometimes referred to as "gradient-free" approaches). By viewing the analysis of the iterations as a problem in the convergence of stochastic processes, we are able to establish a very general theorem, which includes most known convergence results for zeroth-order and first-order methods. The analysis of "second-order" or momentum-based methods is not a part of this paper, and will be studied elsewhere. However, numerical experiments indicate that momentum-based methods can fail if the true gradient is replaced by its first-order approximation. This requires further theoretical analysis.  ( 2 min )
    Explaining Patterns in Data with Language Models via Interpretable Autoprompting. (arXiv:2210.01848v2 [cs.LG] UPDATED)
    Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery. All code for using the methods and data here is made available on Github.  ( 2 min )
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then use our method to compare four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. In this application, we corroborate evidence for the recently proposed L\'evy flight model of decision-making and show how transfer learning can be leveraged to enhance training efficiency. Reproducible code for all analyses is provided.  ( 2 min )
    CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators. (arXiv:2210.06812v2 [cs.LG] UPDATED)
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.  ( 2 min )
    Synthetic A/B Testing using Synthetic Interventions. (arXiv:2006.07691v5 [econ.EM] UPDATED)
    Suppose there are $N$ units and $D$ interventions. We aim to learn the average potential outcome associated with every unit-intervention pair, i.e., $N \times D$ causal parameters. While running $N \times D$ experiments is conceivable, it can be expensive or infeasible. This work introduces an experiment design, synthetic A/B testing, and the synthetic interventions (SI) estimator to recover all $N \times D$ causal parameters while observing each unit under at most two interventions, independent of $D$. Under a novel tensor factor model for potential outcomes across units, measurements, and interventions, we establish the identification of each parameter. Further, we show the SI estimator is finite-sample consistent and asymptotically normal. Collectively, these also lead to novel results for panel data settings, particularly for synthetic controls. We empirically validate our experiment design using real e-commerce data from a large-scale A/B test.  ( 2 min )
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v2 [stat.ML] UPDATED)
    We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v2 [cs.LG] UPDATED)
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Constrained Clustering: General Pairwise and Cardinality Constraints. (arXiv:1907.10410v2 [cs.LG] UPDATED)
    We study constrained clustering, where constraints guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. Most existing constrained clustering models can only utilize one category of constraints at a time. We enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As the two categories provide different useful information, utilizing both allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the unified formulation. To solve this, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. We derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers. Extensive experiments on both synthetic and real data demonstrate when: (1) utilizing a single category of constraint, the proposed model is superior to or competitive with SOTA constrained clustering models, and (2) utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experiments show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to 2%-5% in classical clustering metrics, e.g. Adjusted Random, Mirkin's, and Huber's, indices outerperfomring other methods.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v4 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v2 [cs.LG] UPDATED)
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.  ( 2 min )
    DAG Learning on the Permutahedron. (arXiv:2301.11898v1 [cs.LG])
    We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.  ( 2 min )
    Aleatoric and Epistemic Discrimination in Classification. (arXiv:2301.11781v1 [cs.LG])
    Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy given fairness constraints and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.  ( 2 min )
    From Classification Accuracy to Proper Scoring Rules: Elicitability of Probabilistic Top List Predictions. (arXiv:2301.11797v1 [stat.ME])
    In the face of uncertainty, the need for probabilistic assessments has long been recognized in the literature on forecasting. In classification, however, comparative evaluation of classifiers often focuses on predictions specifying a single class through the use of simple accuracy measures, which disregard any probabilistic uncertainty quantification. I propose probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions. The probabilistic top list functional is elicitable through the use of strictly consistent evaluation metrics. The proposed evaluation metrics are based on symmetric proper scoring rules and admit comparison of various types of predictions ranging from single-class point predictions to fully specified predictive distributions. The Brier score yields a metric that is particularly well suited for this kind of comparison.  ( 2 min )
    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing. (arXiv:2301.11500v1 [cs.LG])
    It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019): GD sequentially learns solutions with increasing ranks until it recovers the ground truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result provides characterizations for the whole learning process. Moreover, besides the over-parameterized regime that many prior works focused on, our analysis of the incremental learning procedure also applies to the under-parameterized regime. Finally, we conduct numerical experiments to confirm our theoretical findings.  ( 2 min )
    DBGSL: Dynamic Brain Graph Structure Learning. (arXiv:2209.13513v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs) have shown success at learning representations of brain graphs derived from functional magnetic resonance imaging (fMRI) data. The majority of existing GNN methods, however, assume brain graphs are static over time and the graph adjacency matrix is known prior to model training. These assumptions are at odds with neuroscientific evidence that brain graphs are time-varying with a connectivity structure that depends on the choice of functional connectivity measure. Noisy brain graphs that do not truly represent the underling fMRI data can have a detrimental impact on the performance of GNNs. As a solution, we propose Dynamic Brain Graph Structure Learning (DBGSL), a novel method for learning the optimal time-varying dependency structure of fMRI data induced by a downstream prediction task. Experiments demonstrate DBGSL achieves state-of-the-art performance for sex classification using real-world resting-state and task fMRI data. Moreover, analysis of the learnt dynamic graphs highlights prediction-related brain regions which align with existing neuroscience literature.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean. (arXiv:2301.11862v1 [stat.ML])
    Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v2 [cs.LG] UPDATED)
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.  ( 2 min )
    Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits. (arXiv:2210.00025v2 [cs.LG] UPDATED)
    How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.  ( 2 min )
    Overparameterized Linear Regression under Adversarial Attacks. (arXiv:2204.06274v2 [stat.ML] UPDATED)
    We study the error of linear regression in the face of adversarial attacks. In this framework, an adversary changes the input to the regression model in order to maximize the prediction error. We provide bounds on the prediction error in the presence of an adversary as a function of the parameter norm and the error in the absence of such an adversary. We show how these bounds make it possible to study the adversarial error using analysis from non-adversarial setups. The obtained results shed light on the robustness of overparameterized linear models to adversarial attacks. Adding features might be either a source of additional robustness or brittleness. On the one hand, we use asymptotic results to illustrate how double-descent curves can be obtained for the adversarial error. On the other hand, we derive conditions under which the adversarial error can grow to infinity as more features are added, while at the same time, the test error goes to zero. We show this behavior is caused by the fact that the norm of the parameter vector grows with the number of features. It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks might behave fundamentally differently due to how the $\ell_1$ and $\ell_2$-norms of random projections concentrate. We also show how our reformulation allows for solving adversarial training as a convex optimization problem. This fact is then exploited to establish similarities between adversarial training and parameter-shrinking methods and to study how the training might affect the robustness of the estimated models.  ( 2 min )
    ActiveLab: Active Learning with Re-Labeling by Multiple Annotators. (arXiv:2301.11856v1 [cs.LG])
    In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods.  ( 2 min )
    Exponential tail bounds and Large Deviation Principle for Heavy-Tailed U-Statistics. (arXiv:2301.11563v1 [math.PR])
    We study deviation of U-statistics when samples have heavy-tailed distribution so the kernel of the U-statistic does not have bounded exponential moments at any positive point. We obtain an exponential upper bound for the tail of the U-statistics which clearly denotes two regions of tail decay, the first is a Gaussian decay and the second behaves like the tail of the kernel. For several common U-statistics, we also show the upper bound has the right rate of decay as well as sharp constants by obtaining rough logarithmic limits which in turn can be used to develop LDP for U-statistics. In spite of usual LDP results in the literature, processes we consider in this work have LDP speed slower than their sample size $n$.  ( 2 min )
    Leveraging the Third Dimension in Contrastive Learning. (arXiv:2301.11790v1 [cs.CV])
    Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.  ( 2 min )
    Variance, Self-Consistency, and Arbitrariness in Fair Classification. (arXiv:2301.11562v1 [cs.LG])
    In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.  ( 2 min )
    Causal Bandits without Graph Learning. (arXiv:2301.11401v1 [stat.ML])
    We study the causal bandit problem when the causal graph is unknown and develop an efficient algorithm for finding the parent node of the reward node using atomic interventions. We derive the exact equation for the expected number of interventions performed by the algorithm and show that under certain graphical conditions it could perform either logarithmically fast or, under more general assumptions, slower but still sublinearly in the number of variables. We formally show that our algorithm is optimal as it meets the universal lower bound we establish for any algorithm that performs atomic interventions. Finally, we extend our algorithm to the case when the reward node has multiple parents. Using this algorithm together with a standard algorithm from bandit literature leads to improved regret bounds.  ( 2 min )
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v5 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.  ( 2 min )
    Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees. (arXiv:2301.11911v1 [cs.LG])
    The completeness axiom renders the explanation of a post-hoc XAI method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. Recently, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. This paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.  ( 2 min )
    Convolutional neural networks for valid and efficient causal inference. (arXiv:2301.11732v1 [stat.ML])
    Convolutional neural networks (CNN) have been successful in machine learning applications. Their success relies on their ability to consider space invariant local features. We consider the use of CNN to fit nuisance models in semiparametric estimation of the average causal effect of a treatment. In this setting, nuisance models are functions of pre-treatment covariates that need to be controlled for. In an application where we want to estimate the effect of early retirement on a health outcome, we propose to use CNN to control for time-structured covariates. Thus, CNN is used when fitting nuisance models explaining the treatment and the outcome. These fits are then combined into an augmented inverse probability weighting estimator yielding efficient and uniformly valid inference. Theoretically, we contribute by providing rates of convergence for CNN equipped with the rectified linear unit activation function and compare it to an existing result for feedforward neural networks. We also show when those rates guarantee uniformly valid inference. A Monte Carlo study is provided where the performance of the proposed estimator is evaluated and compared with other strategies. Finally, we give results on a study of the effect of early retirement on hospitalization using data covering the whole Swedish population.  ( 2 min )
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v2 [stat.ML] UPDATED)
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.  ( 2 min )
    Distributionally Robust Multi-objective Bayesian Optimization under Uncertain Environments. (arXiv:2301.11588v1 [stat.ML])
    In this study, we address the problem of optimizing multi-output black-box functions under uncertain environments. We formulate this problem as the estimation of the uncertain Pareto-frontier (PF) of a multi-output Bayesian surrogate model with two types of variables: design variables and environmental variables. We consider this problem within the context of Bayesian optimization (BO) under uncertain environments, where the design variables are controllable, whereas the environmental variables are assumed to be random and not controllable. The challenge of this problem is to robustly estimate the PF when the distribution of the environmental variables is unknown, that is, to estimate the PF when the environmental variables are generated from the worst possible distribution. We propose a method for solving the BO problem by appropriately incorporating the uncertainties of the environmental variables and their probability distribution. We demonstrate that the proposed method can find an arbitrarily accurate PF with high probability in a finite number of iterations. We also evaluate the performance of the proposed method through numerical experiments.  ( 2 min )
    Distilling Importance Sampling for Likelihood Free Inference. (arXiv:1910.03632v6 [stat.CO] UPDATED)
    Likelihood-free inference involves inferring parameter values given observed data and a simulator model. The simulator is computer code which takes parameters, performs stochastic calculations, and outputs simulated data. In this work, we view the simulator as a function whose inputs are (1) the parameters and (2) a vector of pseudo-random draws. We attempt to infer all these inputs conditional on the observations. This is challenging as the resulting posterior can be high dimensional and involve strong dependence. We approximate the posterior using normalizing flows, a flexible parametric family of densities. Training data is generated by likelihood-free importance sampling with a large bandwidth value epsilon, which makes the target similar to the prior. The training data is "distilled" by using it to train an updated normalizing flow. The process is iterated, using the updated flow as the importance sampling proposal, and slowly reducing epsilon so the target becomes closer to the posterior. Unlike most other likelihood-free methods, we avoid the need to reduce data to low dimensional summary statistics, and hence can achieve more accurate results. We illustrate our method in two challenging examples, on queuing and epidemiology.  ( 2 min )
    Multilayer hypergraph clustering using the aggregate similarity matrix. (arXiv:2301.11657v1 [math.ST])
    We consider the community recovery problem on a multilayer variant of the hypergraph stochastic block model (HSBM). Each layer is associated with an independent realization of a d-uniform HSBM on N vertices. Given the aggregated number of hyperedges incident to each pair of vertices, represented using a similarity matrix, the goal is to obtain a partition of the N vertices into disjoint communities. In this work, we investigate a semidefinite programming (SDP) approach and obtain information-theoretic conditions on the model parameters that guarantee exact recovery both in the assortative and the disassortative cases.  ( 2 min )
    Machine Learning Approach and Extreme Value Theory to Correlated Stochastic Time Series with Application to Tree Ring Data. (arXiv:2301.11488v1 [stat.ML])
    The main goal of machine learning (ML) is to study and improve mathematical models which can be trained with data provided by the environment to infer the future and to make decisions without necessarily having complete knowledge of all influencing elements. In this work, we describe how ML can be a powerful tool in studying climate modeling. Tree ring growth was used as an implementation in different aspects, for example, studying the history of buildings and environment. By growing and via the time, a new layer of wood to beneath its bark by the tree. After years of growing, time series can be applied via a sequence of tree ring widths. The purpose of this paper is to use ML algorithms and Extreme Value Theory in order to analyse a set of tree ring widths data from nine trees growing in Nottinghamshire. Initially, we start by exploring the data through a variety of descriptive statistical approaches. Transforming data is important at this stage to find out any problem in modelling algorithm. We then use algorithm tuning and ensemble methods to improve the k-nearest neighbors (KNN) algorithm. A comparison between the developed method in this study ad other methods are applied. Also, extreme value of the dataset will be more investigated. The results of the analysis study show that the ML algorithms in the Random Forest method would give accurate results in the analysis of tree ring widths data from nine trees growing in Nottinghamshire with the lowest Root Mean Square Error value. Also, we notice that as the assumed ARMA model parameters increased, the probability of selecting the true model also increased. In terms of the Extreme Value Theory, the Weibull distribution would be a good choice to model tree ring data.  ( 2 min )
    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v1 [stat.ML])
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.  ( 2 min )
    Rigid body flows for sampling molecular crystal structures. (arXiv:2301.11355v1 [cs.LG])
    Normalizing flows (NF) are a class of powerful generative models that have gained popularity in recent years due to their ability to model complex distributions with high flexibility and expressiveness. In this work, we introduce a new type of normalizing flow that is tailored for modeling positions and orientations of multiple objects in three-dimensional space, such as molecules in a crystal. Our approach is based on two key ideas: first, we define smooth and expressive flows on the group of unit quaternions, which allows us to capture the continuous rotational motion of rigid bodies; second, we use the double cover property of unit quaternions to define a proper density on the rotation group. This ensures that our model can be trained using standard likelihood-based methods or variational inference with respect to a thermodynamic target density. We evaluate the method by training Boltzmann generators for two molecular examples, namely the multi-modal density of a tetrahedral system in an external field and the ice XI phase in the TIP4P-Ew water model. Our flows can be combined with flows operating on the internal degrees of freedom of molecules, and constitute an important step towards the modeling of distributions of many interacting molecules.  ( 2 min )
    Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference. (arXiv:2301.11674v1 [stat.ME])
    Likelihood-free inference methods typically make use of a distance between simulated and real data. A common example is the maximum mean discrepancy (MMD), which has previously been used for approximate Bayesian computation, minimum distance estimation, generalised Bayesian inference, and within the nonparametric learning framework. The MMD is commonly estimated at a root-$m$ rate, where $m$ is the number of simulated samples. This can lead to significant computational challenges since a large $m$ is required to obtain an accurate estimate, which is crucial for parameter estimation. In this paper, we propose a novel estimator for the MMD with significantly improved sample complexity. The estimator is particularly well suited for computationally expensive smooth simulators with low- to mid-dimensional inputs. This claim is supported through both theoretical results and an extensive simulation study on benchmark simulators.  ( 2 min )
    Neural networks learn to magnify areas near decision boundaries. (arXiv:2301.11375v1 [cs.LG])
    We study how training molds the Riemannian geometry induced by neural network feature maps. At infinite width, neural networks with random parameters induce highly symmetric metrics on input space. Feature learning in networks trained to perform classification tasks magnifies local areas along decision boundaries. These changes are consistent with previously proposed geometric approaches for hand-tuning of kernel methods to improve generalization.  ( 2 min )
    DBGDGM: Dynamic Brain Graph Deep Generative Model. (arXiv:2301.11408v1 [cs.LG])
    Graphs are a natural representation of brain activity derived from functional magnetic imaging (fMRI) data. It is well known that clusters of anatomical brain regions, known as functional connectivity networks (FCNs), encode temporal relationships which can serve as useful biomarkers for understanding brain function and dysfunction. Previous works, however, ignore the temporal dynamics of the brain and focus on static graphs. In this paper, we propose a dynamic brain graph deep generative model (DBGDGM) which simultaneously clusters brain regions into temporally evolving communities and learns dynamic unsupervised node embeddings. Specifically, DBGDGM represents brain graph nodes as embeddings sampled from a distribution over communities that evolve over time. We parameterise this community distribution using neural networks that learn from subject and node embeddings as well as past community assignments. Experiments demonstrate DBGDGM outperforms baselines in graph generation, dynamic link prediction, and is comparable for graph classification. Finally, an analysis of the learnt community distributions reveals overlap with known FCNs reported in neuroscience literature.  ( 2 min )
    Estimating Causal Effects using a Multi-task Deep Ensemble. (arXiv:2301.11351v1 [cs.LG])
    Over the past few decades, a number of methods have been proposed for causal effect estimation, yet few have been demonstrated to be effective in handling data with complex structures, such as images. To fill this gap, we propose a Causal Multi-task Deep Ensemble (CMDE) framework to learn both shared and group-specific information from the study population and prove its equivalence to a multi-task Gaussian process (GP) with coregionalization kernel a priori. Compared to multi-task GP, CMDE efficiently handles high-dimensional and multi-modal covariates and provides pointwise uncertainty estimates of causal effects. We evaluate our method across various types of datasets and tasks and find that CMDE outperforms state-of-the-art methods on a majority of these tasks.  ( 2 min )
    Robust variance-regularized risk minimization with concomitant scaling. (arXiv:2301.11584v1 [stat.ML])
    Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.  ( 2 min )

  • Open

    College classes you guys recommend, to help me build and use neural networks.
    For some context, I plan on eventually majoring in neuroscience. I am enrolling in a local state college soon. I want to get my associates out of the way, and then transferring to a college with a good neuroscience program. It seems like neural networks, are an exciting avenue for research. There's a lot more neuroscience related research being published using neural networks. What courses would you guys recommend me enrolling in? I feel like the obvious one is software development, but maybe there are some others as well. I'd appreciate any insight, thanks! submitted by /u/daddydilly694-20 [link] [comments]  ( 41 min )
  • Open

    Weekly China AI News: China's Master Plan for Robots; Robots Transform into Liquid to Escape Jail; Li Auto's Goal to Become AI Leader by 2030
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    A Video Made using CHATGPT + MidJourney and Text to Speech
    Hey All! I wrote this story and would appreciate your support on my channel and reviews. You can also give me stories for the channel. Thank You for your time :) https://youtu.be/-aM7cSbFFFY The sky was a deep shade of black as the group of friends made their way to the cabin, the only light coming from the beams of their headlights piercing through the heavy rain. The cabin, located deep in the woods, was said to be haunted, but the friends brushed off the rumors as nothing more than tales meant to scare. As they settled in for the night, they laughed and joked, telling ghost stories and watching horror movies. They passed jokes about how they would run if they saw a ghost, unaware of the danger lurking outside. But their laughter was cut short when Dave stumbled across a letter in the…  ( 46 min )
    AI Dream 150 - MINDBLOW MONDAY - AI Video - FINAL MASTERPIECE
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    The Year of AI Breakthroughs 2022
    submitted by /u/pmz [link] [comments]  ( 40 min )
    What media format will ChatGPT and AI bring back that was previously obsolete?
    I'v been thinking a lot about Marshall McLuhan and his 4 laws of media. Specifically, the one that states that all new forms of media cause something to be retrieved from the past. What will ChatGPT and AI revive and retrieve? I put some more thoughts in my blog. Would love to hear your thoughts on it. https://bobhutchins.substack.com/p/what-media-format-will-chatgpt-and submitted by /u/Interesting_Status64 [link] [comments]  ( 41 min )
    Easy-To-Use Voice Models for Fun Round of Trivia
    I'm putting together a trivia night, and one idea I have for a round is "The Pen*s Game" where I replace a word from a famous movie quote with the word "pen*s". I could simply replace it with my own voice, but I don't think it would sound as good/be as funny, so I think it would be amazing to instead use a tool (presumably AI-powered) to achieve this. Is there a reasonably user-friendly way of doing this?Ideally it would be some magical tool that has a nice GUI and celebrity voices already trained, though I doubt this exists... I am well-versed in Python, and am always looking to learn new tools, but I don't have the time to scrape voice data for each actor and train a model (not that I even know where I'd start on that front). As an example, imagine this clip from Shrek except "parfait" is replaced with "pen*s". https://youtu.be/-FtCTW2rVFM?t=86 I don't know this threads rules, so I censor "pen*s" just in case. Any suggestions would be appreciated! Thank you. submitted by /u/wendeborn8 [link] [comments]  ( 41 min )
    AI APIs Implementation - Stack and recommendations
    Hi everyone ! I had this weird idea lately, with the rise of AI to the mainstream, to implement some kind of "Simulator", where one could input a webcam picture, some basic info through a form, & get an alternate small narrated video narrated of one's future. I'm looking for suggestions for open source AI APIs I could use for it. My idea, very simplistically put, is to combine them through input, outputs and some basic logic manipulation. I'm also looking for recommendations for the stack I could use. I've worked with NodeJS and Java w/ Spring Boot on backend, & Angular for frontend. Any recommendations based in the APIs? submitted by /u/WhereIsBryan [link] [comments]  ( 41 min )
    RecolorNeRF is like a basic Photoshop for NeRFs
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Google's AI Tool "MusicLM" creates Music based on text descriptions
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    Which AI chat apps can have phone conversations with me?
    Hi. I tried to voice chat with Kuki on Telegram but she still doesn't have this ability. Replika has voice calls. Do you know other AI apps that can already speak using voice? submitted by /u/Trainer_Red99 [link] [comments]  ( 40 min )
    You against the machine: Can you spot which art was created by A.I.?
    submitted by /u/robbinpetertopaypaul [link] [comments]  ( 41 min )
    📌[Searchcolab] "8K Nature" Links in comment
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    With all of this heated arguing over AI, It's time for a more realistic, balanced analysis. As a filmmaker/writer and tech entrepreneur, here's a perspective that hardly anyone is considering and perhaps one that could help bridge the gap between the haters and lovers of AI.
    submitted by /u/CyborgWriter [link] [comments]  ( 41 min )
    📌[Searchcolab] Generative AI is climbing the *Dimensional Ladder*. I made a figure to show the milestones!
    1D: MusicLM, VALL-E 2D: Stable Diffusion, DALL-E, MidJourney 3D (or 2+1D): Imagen-video, Phenaki 3D: Magic3D, DreamFusion, Point-E 4D (or 3+1D): Make-A-Video-3D [Searchcolab] What’s next? 🤔 https://reddit.com/link/10p6vw9/video/gqbnrsaxh7fa1/player submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    Is there a AI that you can feed books into, so it learns, then communicates using that info?
    Similar to chatgpt, etc. but it doesn't have all the knowledge, but just from a very special field. E.g. if you would like to create an AI for kids, you feed 1000 kids books into it, and it would respond exclusively in a child-friendly or goofy way. It's simply not possible that it responds in any other way, because it doesn't know about the rest of the world. submitted by /u/EndlessSenseless [link] [comments]  ( 41 min )
    AI Has Successfully Imitated Human Evolution—and Might Do It Even Better
    submitted by /u/Itchy0101 [link] [comments]  ( 40 min )
    Google AI music samples/copywrite problems.
    https://ainewsbase.com/google-musiclm-copyright-issues-not-releasing/ The samples they do show might just sound weird because of the stored file or whatever but the sound definitely sounds kinda weird. submitted by /u/SPEEDYFISHY2000 [link] [comments]  ( 40 min )
    3 Books About Artificial Intelligence Everyone Should Read
    I spend about 5 hours a day reading about AI and the news is a lot to keep up with - however there's a lot of big picture AI information that is a good foundation for understanding the future implications of AI. I think these three books offer a good variety of perspective from the philosophical, business, and strategy side of AI. Philosophy: Superintelligence by Nick Bostrom - This S-tier book explores the potential risks and benefits of creating artificial intelligence beyond human levels. The book argues that if the development of superintelligence proceeds rapidly, it could pose an existential threat to humanity. Bostrom paints an incredibly articulate perspective on why we must consider the possibility of creating AI in a way that aligns with human values, and that we must prepare …  ( 44 min )
    Report Says AI Could Potentially Replace 85 Million Jobs Worldwide By 2025 — Are Interns On The List?
    submitted by /u/Mental_Character7367 [link] [comments]  ( 41 min )
    New Artificial Intelligence Subreddit?
    I suggest the need for a new subreddit, where people can announce their creations. You know: "I've created an AI to make s'mores..." "I've created an AI to sort my comic books..." "I've created an AI to find me a girlfriend..." ...ad nauseum. submitted by /u/PredictorX1 [link] [comments]  ( 40 min )
    AI and construction industry
    Hello guys, Have you found any application of AI in the construction industry (notwithstanding the design/modelization)? I've been following the industry for ages and I believe there is so much to be done particularly with AI as it can manage dependancies.. I've quite a few ideas too; get in touch if you are interested to discuss over this :) submitted by /u/MexsEU [link] [comments]  ( 41 min )
    “Nothing Forever”, — AI-generated, always streaming parody of ‘90s sitcoms
    submitted by /u/tinylobsta [link] [comments]  ( 41 min )
    ChatGPT Surpasses Instagram With 10 Million Daily Users In Just 40 Days
    submitted by /u/liquidocelotYT [link] [comments]  ( 42 min )
    Vector animals bundle
    submitted by /u/annal201 [link] [comments]  ( 40 min )
    AI is becoming a commodity
    submitted by /u/_utisz_ [link] [comments]  ( 40 min )
  • Open

    [D] Is the YoloR paper worth looking into?
    Doing a survey of object detection papers with plausible application to pose-estimation tasks. Came across the paper "You Only Learn One Representation" and, while the theory seems interesting, I want to hear people's opinions before doing a deep dive into the theory. submitted by /u/answersareallyouneed [link] [comments]  ( 42 min )
    [D] Towards A Token-Free Future In NLP
    https://peltarion.com/blog/data-science/towards-a-token-free-future-in-nlp submitted by /u/EducationalCicada [link] [comments]  ( 42 min )
    [P] I launched “CatchGPT”, a supervised model trained with millions of text examples, to detect GPT created content
    I’m an ML Engineer at Hive AI and I’ve been working on a ChatGPT Detector. Here is a free demo we have up: https://hivemoderation.com/ai-generated-content-detection From our benchmarks it’s significantly better than similar solutions like GPTZero and OpenAI’s GPT2 Output Detector. On our internal datasets, we’re seeing balanced accuracies of >99% for our own model compared to around 60% for GPTZero and 84% for OpenAI’s GPT2 Detector. Feel free to try it out and let us know if you have any feedback! submitted by /u/qthai912 [link] [comments]  ( 56 min )
    [D] Are there neural net plugins to assist audio editing of Youtube screencasts?
    In order to improve my talking skills, I am doing a little series on how to setup Stable Diffusion on Paperspace, and I am astounded how much time it takes to do the audio editing. Well, part of the reason is that I've only been doing this for 3 days and my process is very inefficient, but it feels that in the current time, neural nets should be able to do things like remove uhms, lip smacking and breath intakes. I've looked around, and this post from 9 years ago says the only choice is to edit it by hand. Is that still true? submitted by /u/abstractcontrol [link] [comments]  ( 43 min )
    [D] What's stopping you from working on speech and voice?
    I've been working in the speech and voice space for a while now and am now building out some tooling in the space to make it easier for researchers/engineers/developers to build speech processing systems and features; I'd love to hear what people in ML struggle with when you're trying to build or work with speech processing for your projects/products (beyond speech-to-text APIs) submitted by /u/jiamengial [link] [comments]  ( 47 min )
    [D] DL university research PC suggestions?
    I am a researcher at a US university and have a budget of 25k to build a PC for training various ML algorithms (e.g. DRL, neuromorphic computing, VAE, etc). I'm trying to decide between going for prebuilds (like https://lambdalabs.com/gpu-workstations/vector) or building with consumer cards like 4090s. Any advice on which is the most bang for the price? Im not sure how much Im giving up by going for consumer 24g cards vs a6000, 6000 ada but prebuild prices go up quick. Warrantee vs building it myself isn't an issue submitted by /u/seanrescs [link] [comments]  ( 44 min )
    [R] Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models - Stanford University Eric Zelikman et al - Beats prior code generation sota by over 75%!
    Paper: https://arxiv.org/abs/2212.10561 Github: https://github.com/ezelikman/parsel Twitter: https://twitter.com/ericzelikman/status/1618426056163356675?s=20 Website: https://zelikman.me/parselpaper/ Code Generation on APPS Leaderboard: https://paperswithcode.com/sota/code-generation-on-apps Abstract: Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be …  ( 45 min )
    [P] Keras model production deployment
    Hi guys. It's been some time since I started developing my Keras models, but now is the first time I am trying to push it to production. My Keras model looks like this: model = Sequential() model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(TimeDistributed(Dense(1, activation='sigmoid'))) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) My problem is I need to run through about 25 of these for every written sentence. There is going to be an online editor, where users can paste text for my analysis. That means up to about 300 words or about 20 sentences at once. With the current time to run each network (about 0.2s), that means 25 * 0,2 * 20 or about 100s per user input. I am going for 30 seconds at most with potentially dozens of users at once. Ideally on a Raspberry Pi 4. The internet is surely gonna back me up I thought to myself and started googling. If only I know what kind of a rabbit hole I was about to fall into. First I converted my Keras model into a TensorFlow frozen graph model. 10x time improvement on CPU, but still at 0.2s on average. Another thing I think may boost the performance is retraining the models for variable input shape (currently I always feed in 50 values). With the average sentence size of 16 words this may, from what I understand, lead to a 3 times boost? My question is: now what? What can I do to make it faster? Is it even possible to run it on a Raspberry Pi 4 and get reasonable response times? If not, what is my best option on a tight budget? submitted by /u/ProfessionalOverall8 [link] [comments]  ( 45 min )
    [D] I want to understand the broad steps for building something like Adept.AI
    From the given link!, I gather that it is a large-scale Transformer trained to use digital tools like a web browser. Right now, it’s hooked up to a Chrome extension which allows it to observe what’s happening in the browser and take certain actions, like clicking, typing, and scrolling, etc. I am interested in knowing the broad steps involved in building something like this. submitted by /u/smred123 [link] [comments]  ( 43 min )
    [Discussion] ChatGPT and language understanding benchmarks
    The general consensus seems to be that large language models, and ChatGPT in particular, have a problem with accuracy and hallucination. As compared to what, is often unclear, but let's say as compared to other NLP methods of question answering, language understanding or as compared to Google Search. I haven't really been able to find any reliable sources documenting this accuracy problem, though. The SuperGLUE benchmark has GPT-3 ranked #24, not terrible, but outperformed by old models like T5, which seems odd. GLUE nothing. SQUAD nothing. So, I'm curious: Is there any benchmark or metric reflecting the seeming step-function made by ChatGPT that's got everyone so excited? I definitely feel like there's a difference between gpt-3 and chatGPT, but is it measurable or is it just vibes? Is there any metric showing ChatGPT's problem with fact hallucination and accuracy? Am I off the mark here looking at question-answering benchmarks as an assessment of LLMs? Thanks submitted by /u/mettle [link] [comments]  ( 46 min )
    [D]Are There Studies on text-davinci-003's Zero/Few-shot Performance on Various Academic Benchmarks?
    Has anyone come across studies on GPT3 text-davinci-003's zero/few-shot performance over various NLP benchmarks and how they compare to current SoTA? E.g GLUE, SuperGLUE and over more classic ones like CoNLL 2003 NER. I thought it would be pretty interesting to see how far zero/few-shot learning with LLM has progressed with RLHF and instruction tuning. Am surprised that nobody has done such a benchmark yet. submitted by /u/gamerx88 [link] [comments]  ( 42 min )
    [D] Sparse Ridge Regression
    Hi all! Given X ∈ ℝ Nx, Y ∈ ℝ Ny, β ∈ ℝ+, so W = YXT(XXT+βI)-1 (with the Moore–Penrose pseudoinverse) where A = YXT and B = XXT+βI. If we consider an arbitrary number of indices/units < Nx, and so we consider only some columns of matrix A and some columns and rows (crosses) of B. The rest of A and B are zeros. The approach above of sparsify A and B will break the ridge regression solution when W=AB-1? If yes, there are ways to avoid it? Many thanks! submitted by /u/antodima [link] [comments]  ( 43 min )
    [R] A Robust Hypothesis Test for Tree Ensemble Pruning
    I'm looking for help/feedback with this paper. Please let me know if the method is interesting and if there's ways to improve it! https://arxiv.org/abs/2301.10115 Abstract: Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms. submitted by /u/asi_dm [link] [comments]  ( 43 min )
    [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!)
    https://github.com/tysam-code/hlb-CIFAR10 submitted by /u/tysam_and_co [link] [comments]  ( 53 min )
  • Open

    Top Android App Development Trends in 2023
    The global mobile app development revenue is $526 billion, making it one of the most flourishing industries worldwide. Android controls 73% of the market share. So, if you have plans to build an Android app, there’s no better time. However, to ensure your Android app stands out, you must keep an eye on the latest… Read More »Top Android App Development Trends in 2023 The post Top Android App Development Trends in 2023 appeared first on Data Science Central.  ( 22 min )
    Enabling contextual computing in today’s enterprise information fabrics
    During the 1970s, Ethernet pioneer and 3Com Internet equipment company founder Bob Metcalfe was working on something called the “Data Reconfiguration Service” for the early Internet. “It was an effort to write a special purpose programming language to convert data formats, Metcalfe said during a 2021 OriginTrail.io panel session. “And the goal was so that… Read More »Enabling contextual computing in today’s enterprise information fabrics The post Enabling contextual computing in today’s enterprise information fabrics appeared first on Data Science Central.  ( 21 min )
  • Open

    Amazon SageMaker built-in LightGBM now offers distributed training using Dask
    Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, […]  ( 12 min )
    Build a water consumption forecasting solution for a water utility agency using Amazon Forecast
    Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts, without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating supply and demand for inventory management, travel demand forecasting, workforce planning, and computing cloud infrastructure usage. You can use Forecast […]  ( 10 min )
  • Open

    Good autocomplete
    I’m not sure whether automatic text completion on a mobile device is a net good. It sometimes saves a few taps, but it seems like it’s at least as likely to cause extra work. Although I’m ambivalent about autocomplete on my phone, I like it in my text editor. The difference is that in my […] Good autocomplete first appeared on John D. Cook.  ( 6 min )
  • Open

    Best recurrent RL library?
    Does anyone know which library has the best support for recurrent RL algorithms? It seems like many implement recurrent PPO and maybe one other recurrent algorithm and that's it. Most implementations treat recurrence as an afterthought, making it buggy and hard to extend. I'd like a library with first-class recurrent support for: DQN/Rainbow SAC PPO etc. Is anyone familiar with such a library? submitted by /u/smorad [link] [comments]  ( 42 min )
    Generative Meta-Learning for Robust Quality-Diversity Ensemble under Stochastic Rewards
    Gen-Meta is a learning-to-learn method for evolutionary illumination that is competitive against SotA methods in Nevergrad, with a much superior scalability for large-scale optimization. The key to out-of-sample robustness in portfolio optimization is quality-diversity optimization, where one aims to obtain multiple diverse solutions of high quality, rather than one. Generative meta-learning is the only portfolio optimization method that performs QD optimization to obtain a robust ensemble portfolio consisting of several de-correlated sub-portfolios. In the below image, the red line is the index to be tracked, and the blue line is the sparse portfolio ensembled from a thousand behaviorally-diverse sub-portfolios co-optimized (other lines). ​ Red Line: Tracked Index, Blue Line: Sparse Ensemble, Others: Diverse Subportfolios In Gen-Meta portfolio optimization, a Monte-Carlo optimization is performed over those portfolio candidates to reward each individual separately in randomly selected historical periods. To further optimize the portfolio robustness, the portfolio weights of the candidates are heavily corrupted first by adding noise and then dropping out the vast majority of their weights. I previously open-sourced the application of Gen-Meta in sparse index-tracking. Hence, I invite you to do your ablation study to see how each technique affects the out-of-sample robustness. The following repository includes comments on those critical techniques performed to obtain a robust ensemble from behaviorally-diverse high-quality portfolios co-optimized with Gen-Meta. The codes for Gen-Meta in sparse index-tracking The comparison in-between Gen-Meta & Nevergrad submitted by /u/k_yuksel [link] [comments]  ( 42 min )

  • Open

    We built a browser extension that unlocks browser mode capabilities using ChatGPT: MULTI·ON: AI Web Co-Pilot powered by ChatGPT
    submitted by /u/DragonLord9 [link] [comments]  ( 41 min )
    AI Dream 150 - MY HEAVENLY DREAM BY AI - Part6 TEASER - AI Video vqgan ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Just published my new edu-tainment video on A.I. Tried to make it funny but also drop some mind-blasting facts!
    Things the video covers: What is intelligence? What is A.I.? What is the best currently available and what are the benefits? How does it work? What are the downsides? The increasing speed of human technological advancement Why A.I. actually terrifies me! (Some scenarios) I hope you enjoy it! submitted by /u/casualbob_uk [link] [comments]  ( 41 min )
    AI will drastically reduce developer jobs.
    Farmers still exist today but they exist in drastically fewer numbers than two centuries ago. The modern farming machinery and techniques did not replace farmers but made the industry much less labor intensive. Nowadays programming is a labor intensive activity with relative high salaries. AI is introducing the possibility to do this activity, that worldwide cost companies billions of dollars in programmers salaries, much more efficiently. In my opinion, this is the goal of companies like OpenAI. They know that they can’t remove humans out of the loop because current AI is not able to substitute all human cognitive capabilities that intervine in a software developer daily job; like talking with the client, figuring out what he wants and translating it to functional requirements. But nonetheless, they think they have a clear shot to make programming a non labor intensive activity like farming is today. Of course, this is a compelling multibillion business opportunity that is attracting increasing capital from the tech and the financial sectors. submitted by /u/masterile [link] [comments]  ( 43 min )
    Found this list of AI tools. It was nice discovering some ai video editing tools i have not heard before.
    submitted by /u/lshic [link] [comments]  ( 40 min )
    Elon Musk Say AI Will be Able to Simulate Conciousness, for me thats very difficult to happen but idk
    https://youtu.be/Y6gXZ61NnOE submitted by /u/sigmabruuh [link] [comments]  ( 42 min )
    Best image generators for graphic Novel?
    I'm trying to see if I can use an image generator to illustrate a story I am writing. The thing is the story itself is action-packed and dramatic so I thought I could illustrate it instead as a graphic novel generated by AI but I am worried about the consistency an quality of the images. Also, I can't really use passages in the story as a prompt because the generators don't seem to illustrate the scene well. Any suggestions? submitted by /u/swagonflyyyy [link] [comments]  ( 41 min )
    📌[Searchcolab] Site to share and find Colab Notebooks, CKPT and SafeTensors
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    What ChatGPT Could Mean for the Metaverse
    submitted by /u/Zurevu [link] [comments]  ( 40 min )
    AI (GPT) where you can ask data questions in English and automatically generate the answer - as if you have your own personal automated data analyst
    submitted by /u/lfogliantis [link] [comments]  ( 46 min )
    Big Tech was moving cautiously on AI. Then came ChatGPT.
    submitted by /u/nikko_fan [link] [comments]  ( 40 min )
    AI Dream 150 - COSMIC STRUCTURES Part5 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    📌[Searchcolab] Disco Diffusion v5.6. Link in comments
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Using Ai to CLONE my Voice!
    https://www.youtube.com/watch?v=4RUQ_vZ-og0 This Ai can instantly clone your voice and sound exactly like you! It's both amazing and terrifying. The Ai is called ElevenLabs - it has features like text to speech and voice cloning! submitted by /u/Peter3tv33 [link] [comments]  ( 41 min )
    Google AI can create music in any genre from a text description
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    Knight Rider game. Midjourney, ChatGPT, Figma.
    submitted by /u/sidianmsjones [link] [comments]  ( 40 min )
    AI generated music video
    submitted by /u/aquin1313 [link] [comments]  ( 40 min )
    What if Patrick Bateman was a Data Scientist? An AI-Generated Video
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )
  • Open

    Why did the original ResNet paper not use dropout?
    The ResNet paper by Kaiming He et al. does not use dropout for the models. A lot of models prior to ResNets, such as AlexNet and VGGNet gained from using dropout. Why did the authors choose not to use dropout for ResNets ? Is it because they use L2 regularization(weight decay) and batch normalization which are forms of regularization which can substitute dropout regularization ? submitted by /u/V1bicycle [link] [comments]  ( 41 min )
    Answer this please
    so, I have been learning what DL is and how NN learns to do stuff. From what I understand is the repeated iteration will take random weights and at some point those weights will be kinda perfect for the given task (plz correct me if i'm wrong) Ok, so lets take an example of a task like path finding AI, so we make a NN and train it to go from point A to point B, now it is trained and doing nice and goes to point b perfectly, SO here the weights are set to go from point A to point B right? What if we give the point B somewhere else, How will the AI get perfect weights as the current weights are only perfect for current point B What if we put an obstacle in between point A and B, how will the NN set weights, or is it something like a range of weights which are perfect for any given task for NN ​ IDK if I explained it right, plz comment if you have question about my question, and answer also💕 submitted by /u/Severe-Improvement32 [link] [comments]  ( 42 min )
  • Open

    [R] Industrial Case Study of GNNs with PyTorch Geometric for Document Understanding
    submitted by /u/how-it-is- [link] [comments]  ( 42 min )
    [R] Incorrect Ranking of Vessel Segmentation Algorithms
    In a recent article, we reviewed dozens of image segmentation algorithms and pointed out mathematically that in many cases the reported performance scores could not be the results of the evaluation methods claimed by the authors. The scores are primary indicators of value and serve as measures of the state-of-the-art to be outperformed by new algorithms. Unfortunately, algorithm rankings turned out to be incorrect in many of 100 papers and the problem is systematic. The pressure to outperform flawed performance scores to get published keeps the trend on-going. How should the community deal with a phenomenon like this: flaws uncovered, factual, undeniable yet on-going? Is the 258th algorithm proposed for a problem more valuable than reproducing a highly cited article? Should it be mandatory to share source code? Is there a merit in developing consistency checks like the ones we did? Any comments are welcome! https://arxiv.org/abs/2111.03853 submitted by /u/AttilaFazekas [link] [comments]  ( 43 min )
    [D] Remote PhD
    Hi all, During the pandemic many software companies transitioned their workforce to "fully-remote" or "partially-remote"; therefore, I was wondering if any reputable institutions offer a remote CS PhD? For context, I know of several individuals who have sorted out remote work with their PIs on a per-person basis (typically after the first 1-2 years of study), but I am not aware of any labs or programs that advertise remote study. Thank you in advance for the responses. Cheers, Matt submitted by /u/TheRealMrMatt [link] [comments]  ( 45 min )
    [N][R] Compiling and running GLM-130B on a local machine (4x 3090s, int4 quantization) - Author: Alex J. Champandard
    Twitter link to his post: https://twitter.com/alexjc/status/1617152800571416577?s=46&t=CMQT9rK4F1Lt7g7aX2vTJA also important in that regard: The case for 4-bit precision: k-bit Inference Scaling Laws - Tim Dettmers Paper: https://arxiv.org/abs/2212.09720 https://preview.redd.it/7nn0pfhn81fa1.jpg?width=585&format=pjpg&auto=webp&s=2d05998c32fb1eacf56c45e830047381d544f51f https://preview.redd.it/0084vhhn81fa1.jpg?width=598&format=pjpg&auto=webp&s=c9512275714964faa312e8fb2d96ab8ded7dd127 submitted by /u/Singularian2501 [link] [comments]  ( 42 min )
    [P] AI Content Detector
    Have you tried ChatGPT? It's super cool but some users are also using it to create automated content submissions and resulting in an increase in AI-generated plagiarism. I have made a tool as a college project to detect content generated using AI. Go ahead and validate your content on AI Content Detector If you are an educator worried about automated content submissions or developers worried about search engine penalties, this tool will help everyone to efficiently detect content generated using AI. submitted by /u/YoutubeStruggle [link] [comments]  ( 44 min )
    [D] what is roughly the cost of human-annotation vs compute to adapt a LLM?
    Let's say I pull a pre-trained LLM off of huggingface. In broad strokes (making whatever assumptions appropriate), what is the relative cost of getting human annotation data versus actually incorporating those data in through training? I've been trying to get this stats and so far the ratio seems to be 2:1, meaning if you spent 10k dollars collecting human annotations, you should expect to spend 5k on compute (finetune, RLHF, ect) but I'd be happy if someone with more experience can chime in. submitted by /u/evanthebouncy [link] [comments]  ( 42 min )
    [D] AI Theory - Signal Processing?
    On This page of Meta AI research where they mention AI theory as a topic, they mention that they use techniques from Signal Processing. As someone with an Electrical Engineering background, and interests in Mathematics and AI, I found this very intriguing. Can someone tell me some of the ways signal processing has been used in AI theory? Some papers or some work done? submitted by /u/a_khalid1999 [link] [comments]  ( 47 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 43 min )
    [P] Automating a Youtube Shorts channel with Huggingface Transformers and After Effects
    I’ll try to get into detail about the implementation and difficulties in case it is useful for anyone else trying to do something similar with an applied ML project, so there’s a TLDR at the end if you’d like the short version/result. At the end of last year I convinced myself to start 2023 by creating a side-project that I'd actually finish and deploy and perhaps earn some “passive” income (spoiler, not so passive after all :P), and after some brainstorming I settled on making an automated Youtube channel about finance news since I had just gotten into investing. Shorts seemed to be more manageable and monetization is changing in February so I went with that. My rough initial idea was to get online articles, summarize them, make a basic compilation with some combination of pymovie, open…  ( 47 min )
    [D] GPT-Index vs Langchain
    Someone I work with wrote the below for our internal team (shared with permission) and I thought some here may find it helpful. Recently, I built an app that uses GPT-Index & LangChain to provide an answer to a question based on a piece of text as context. I found GPT-Index to be much easier and straightforward to integrate, but it seems like LangChain has more features and is more powerful. Here's my experience integrating both of them. GPT-Index First thing I did was review their docs to make sure I understood what GPT-Index was, what it could do, and how I was going to use it I went back and forth a couple times figuring out how I was going to use it. Then I found the quickstart guide It seemed like the quickstart guide would work so I followed the guide and after a few tries, …  ( 45 min )
    [P] Targeted Summarization - A tool for information extraction
    submitted by /u/helliun [link] [comments]  ( 44 min )
    [D] How do people keep up with ML news that is not NLP related?
    Lately, NLP is taking up most of the public space, much of AI news is focused on LLM after Chat-GPT took the spotlight. How do non-NLP people keep up with news? I recently saw a post on reddit where tree models are still being improved. There are other topics too, like the recent trend in Model Explainability which feels to have slowed down. I'd guess this all gets into the more categorical questions which I am wrapping up with 'How do YOU get your ML news'? How does information gathering differ between those in Applied ML and AI researchers (or even further, between those in Business Analytics and those in more 'AI' fields) What sort of interesting things are out there in the world of ML now? (model or non-model related) In looking for Use Cases, does this partially come down to your field? (Finance reads finance news, pharma reads pharma news) ​ Many of the AI/ML Newsletters which I subscribed to when I was less experienced seemed to be full of variety, but as they are all converging to NLP recently maybe it is time to cleanse the subscriptions, or find some new resources. submitted by /u/shaner92 [link] [comments]  ( 44 min )
    [R] InstructPix2Pix: Learning to Follow Image Editing Instructions
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 45 min )
    [P] We Built an ML Search Engine that can find exact timestamps for anything on Youtube using OpenAI Whisper and UKPLab's SBERT Sentence Transformers
    submitted by /u/tomiwa1a [link] [comments]  ( 44 min )
    [Discussion] Stable Diffusion Models with Subject/Keyword References
    I was looking for research that centers around Stable Diffusion models but can be trained with seed images of a specific subject, so that if someone refers to a keyword like "Me" or "I" it would then produce images relative to the keyword of interest. Something like "I am swimming in a beautiful ocean with mountains in the background and wearing a speedo". Then the person subject in the photo would be myself since I referenced "I". submitted by /u/sambrojangles [link] [comments]  ( 43 min )
  • Open

    Small-scale automation
    Saving keystrokes is overrated, but maintaining concentration is underrated. This post is going to look at automating small tasks in order to maintain concentration, not to save time. If a script lets you easily carry out some ancillary task without taking your concentration off your main task, that’s a big win. Maybe the script only […] Small-scale automation first appeared on John D. Cook.  ( 5 min )
  • Open

    how to do when the simulation fails
    Hi, When my agent succeeds to achieve the goal, it receives done = True. But, when it fails in the simulation, for example, it crashes onto the obstacle, should I end the episode and set the done to "True" or is it okay if I just give it to penalty like reward -= 5 at the specific steps? Thanks submitted by /u/sonlightinn [link] [comments]  ( 42 min )

  • Open

    InstructPix2Pix Video: One Vase, Many Artists
    submitted by /u/anitakirkovska [link] [comments]  ( 40 min )
    Make-a-Video3D: Meta generates 3D scenes from text
    submitted by /u/henlo_there_fren [link] [comments]  ( 40 min )
    AI Characters Become Self-Aware Ft. MrBeast - (Funny Moments)
    submitted by /u/SnowDustHD [link] [comments]  ( 40 min )
    AI Dream 150 - INTERSTELLAR PORTAL Part4 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI will also help you find a job: job offers related to this area grow by 31%
    submitted by /u/nikko_fan [link] [comments]  ( 40 min )
    What are some AI models out there that are not owned by large businesses that require my personal info to use?
    submitted by /u/temporaryAMA [link] [comments]  ( 40 min )
    Explain Cinderella as if you are Donald Trump
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Should I be learning data analysis now if my goal to become an AI engineer/researcher?
    Hello, I'm currently self-studying math with the goal of learning up to liner-algebra and then returning to school for a BS in mathematics followed by a graduate and then post-graduate degree in a subjecting relating to AI research. Given this goal I was wondering if I should also be self-studying data analysis along with the math. I reason that this might be good since I'd be getting hands-on experience with data and I might even be able to do it as a job while I study. In my math studies I'm currently on Algebra 2 so if I should be learning data analysis, can I start learning it now or would it be better to start after I've learned more advanced math like calculus? submitted by /u/multioutletplug [link] [comments]  ( 42 min )
    Hello guys. Could you please share experience with learning and working with Neural networks? I have already done with some of the tutorials, but want to know which language is best suitable to create AI (Python, Java).
    submitted by /u/Your_mag [link] [comments]  ( 42 min )
    Google's MusicLM is Astoundingly Good at Making AI-Generated Music, But They're Not Releasing it Due to Copyright Concerns
    submitted by /u/HODLTID [link] [comments]  ( 45 min )
    Research
    Hello everyone, I am currently doing a college research on artificial intelligence and ethics, looking to specify in the field of Bias and fairness in AI. I am looking for someone with some knowledge in the field to conduct a brief interview with in which I ask a couple of questions related to the topic. If you are interested, kindly @ me. Thanks :) submitted by /u/AyanT_01 [link] [comments]  ( 40 min )
    ChatGPT is not all that good
    I get inaccurate information form ChatGPT when I ask questions in areas where I have a good bit of knowledge. I couple of times it has cited specific books or papers by title and author and, as far as I can tell, the books/papers don't exists. The second time it did this, I ask it why it makes this kind of odd mistakes and it said: "As a language model, my information is based on the text that I have been trained on, and I try to provide the most accurate and relevant information based on that data. However, I am not infallible and sometimes make mistakes or provide information that is inaccurate or out-of-date. I apologize for any inaccuracies in my previous responses, and for any confusion that they may have caused. It's important to know that the information provided by AI models like me should be taken as a starting point for further research and not as the final answer. It's always a good idea to consult with experts or verify information with multiple sources. Again, I apologize for any inaccuracies in my previous statements and any confusion they may have caused." But it is still weird that it would make up specific books and papers, I don't see why a language model would do that. Edit: Maybe the text that it is trained on references specific books and paper that don't exist. And I guess that would be text that are not indexed by Google and other places where I am doing my fact checking such that I cannot figure out where it is getting this stuff. submitted by /u/facinabush [link] [comments]  ( 46 min )
    Made a website to teach you how to make passive/active income using AI!
    I recently created a website called https://cashwithai.com that is dedicated to helping people learn how to make money using AI like ChatGPT. The website offers a variety of resources, including a QuickStart guide, case studies, and tips and tricks for monetizing AI-generated content. Additionally, I'm offering free 1-on-1 consultations to anyone who is looking for personalized advice and guidance on how to make money with AI. I'm not running ads or charging; I run purely off donations. Let me know if you have any questions! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    New Incredible Text-To-Music Generation Model By Google
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )
    What is the information content in a logic statement?
    I recently had a lot of fun working with my talented peers at IBM Research putting together a point of view to this fascinating question. Wanted to share with you our work and to get feedback from you. ibm.biz/logic_and_information submitted by /u/Consistent_Listen127 [link] [comments]  ( 40 min )
    AI will not replace marketers, but marketers who use AI will replace those who don’t.
    AI will not replace marketers, but marketers who use AI will replace those who don’t. submitted by /u/TheVellerShow [link] [comments]  ( 40 min )
    OpenAI Hired Developers to Train its AI to Replace Developers
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 24 min )
    image to voice ai stable image to voice
    submitted by /u/VNKT-FOREVER [link] [comments]  ( 40 min )
    Real world applications of AI
    There has been many advancements in especially in CV, NLP etc. But, I don't see many new AI models that are solving real world problems. There has been a lot of advancements in AI in RL, DL etc but I rarely see new applications in real world. All i see are text to image models, advanced chat bots, better game playing AIs etc. These are definitely amazing, but, I wanna see new stuff which is making a good impact on the real world. Alpha-fold, driverless cars etc are sorta things I am looking for. I don't know if I am just bluntly unaware of new stuff in AI which solves practical problems, or whether not much new stuff is happening in those areas. So, I would be glad if I can know more about how AI is being used in new ways to solve real-wprld problems, or any new AI research trying to tackle a real world problem? Sry if it's a stupid question, and sry for spamming "real world" too much. submitted by /u/Unhappy_Version7565 [link] [comments]  ( 44 min )
    Want to catch up on the breakthroughs in AI the last 10 years. What should I read?
    Hi guys, I have a technical background and have studied some AI in the past. I'd like to catch up on the latest developments in AI over the last ~10 years. Do you have any recommendations on what to read? I was thinking of maybe trying to get the top most cited papers in the last 10 years and reading them. But I'm not sure where would be the best place to find that. Any suggestions? Thanks in advance. submitted by /u/AlexWD [link] [comments]  ( 41 min )
    spooky season 💅
    submitted by /u/Rich_Dragon17 [link] [comments]  ( 40 min )
    BuzzFeed Plans to 'Hire' ChatGPT as Newest Content Creator
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
  • Open

    [P] tiny-diffusion: a minimal PyTorch implementation of probabilistic diffusion models for 2D datasets
    submitted by /u/tanelai [link] [comments]  ( 43 min )
    [D] Running a small Bloom checkpoint on a mini pc.
    Has anyone been able to run a small LLM checkpoint in commodity hardware, like a mini PC? If so, what were your specs? submitted by /u/onedjscream [link] [comments]  ( 42 min )
    [D] could multiple-input transformers reduce the pain of the training data acquisition problem?
    so it's big pop-sci news that we are running out of quality textual training data (soft-paywalled article, but you get the idea) to produce chinchilla-optimal language models, and they appear to continue learning new abilities as data and parameter size increase. when an infant learns what a cat is, it is not only described, but the infant can see it and understand its form and behavior in a way it can then go on to describe and extrapolate from (even if they are blind, they can touch it and understand its shape and feel its fur). LLMs have to do this the hard way: their generalized understanding of the shape and behaviour of a cat comes from textual descriptions of them (and they would need quite a lot in order to understand!) most of the research i have seen into multiple input transformer models has been with the purpose of task completion (google's embodied language model robot butlers etc, which often use textual descriptions fed to a normal LLM, see https://innermonologue.github.io/ ) or image recognition and understanding (such as in CLIP) but not necessarily applying it to textual completion, which seems like it could benefit from a more visual understanding of the world so, in the medium or short term, to improve performance on text-completion tasks, what are your thoughts on using image training as well as textual to improve generalization for LLMs with fewer text tokens on a new architecture? (also, please excuse any ignorance i may posess: i'm a bit of an armchair ai enjoyer) submitted by /u/Dankmemexplorer [link] [comments]  ( 44 min )
    GAN Training Gradient questions [D]
    The main reason why this is not in the simple questions thread is the need to includee an image. Here is an image of my generator's and discriminator's gradients being logged onto wandb. As you can see, they have these weird hasms, where the distribution of the gradients becomes very close to zero. These chasms are correlated for the discriminator and generator and seem very regular. Anyone experienced anything like this and maybe has a hunch of what might be the cause? https://preview.redd.it/3qipyez6stea1.png?width=624&format=png&auto=webp&s=f5fb39803082add2fd78979e880e6e784b9e4c0c submitted by /u/Hub_Pli [link] [comments]  ( 43 min )
    [D] Goodness of fit question
    For regressions, R-squared and Adj. R-Squared are typically considered the primary goodness-of-fit measures. But in many supervised machine-learning models, RMSE is the main measure that I keep running across. For example, decision tree models that I create in R using Rpart do that. So, my question is how to compare the predictive accuracy of OLS regression models that report R-sq to equivalent Rpart regression trees that report RMSE. submitted by /u/bridgeton_man [link] [comments]  ( 43 min )
    [N] OpenAI has 1000s of contractors to fine-tune codex
    submitted by /u/yazriel0 [link] [comments]  ( 45 min )
    [P] GPT JupyterLab - JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells.
    Hi everyone, I made a JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells. This extension passes your current notebook cell to the GPT API and completes your code/text for you. You can customize the GPT parameters in the Advanced Settings menu. I made this extension when I couldn't find any Copilot/Codex extensions for JupyterLab. It doesn't make sense that ML folks don't have an easy way to use AI generated code in their own tools. VS Code does allow you use Copilot, but I've gotten used to Jupyter and a lot of ML/DS folks I know still prefer using Jupyter over VS code. Installation pip install gpt_jupyterlab GitHub Repo: https://github.com/henshinger/gpt-jupyterlab/ Demo GPT JupyterLab Demo Note: You will need your own OpenAI API Key to use this extension. Would love to get your feedback! submitted by /u/henshinger [link] [comments]  ( 44 min )
    [P] Launching my first ever open-source project and it might make your ChatGPT answers better
    I am building an open-source ML observability and refinement toolkit which recently got investment from YCombinator. The tool helps ML practitioners to: 1. Understand how their models are performing in production 2. Catch edge-cases and outliers to help them refine their models 3. Allow them to customise the tool according to their needs (hence, open-source) 4. Bring data-security at the forefront (hence, self hosted) You can check out the project https://github.com/uptrain-ai/uptrain and would love to hear feedback from the community submitted by /u/Vegetable-Skill-9700 [link] [comments]  ( 43 min )
    [P] Parse research papers into structured data
    ​ https://preview.redd.it/1t7spoqxdsea1.png?width=1920&format=png&auto=webp&s=b4643e418d942260b16019ee250edf56a4336b4b paperai is a semantic search and workflow application for medical/scientific papers. It can be used to take a set of research papers, parse the content and turn it into structured data. paperetl is the underlying library that parses basic metadata such as title, publication and date out of the papers. https://preview.redd.it/5jywynarfsea1.png?width=1084&format=png&auto=webp&s=437b6cdb65fbf12d6d238f84fd94ec4b85dec93b In addition to standard metadata, paperai can also run extractive queries to build additional columns. https://preview.redd.it/8rzfj016esea1.png?width=1138&format=png&auto=webp&s=ccd73c34f0f3001dfe05a0f8c480cf73451dd447 Example notebooks and Docker files can be found on GitHub. paperai | paperetl submitted by /u/davidmezzetti [link] [comments]  ( 43 min )
    [P] Annotating text with sparse human annotations and different length chunks
    I want to automate the annotation of a domain-specific text (complicated contracts) by finetuning a BERT model. However the annotated text I've been provided by the domain experts has been sparsely annotated (i.e. paragraph 40-48 has been fully annotated, while 15 other paragraphs only have certain classes annotated for certain words. Most paragraphs have nothing annotated (like 70% of the entire corpus) Another complication is that for 1 class, the entire paragraph should be annotated in this class, while for others it's a single word or a sentence. There are 7 classes in total and in the end, all tokens should be annotated to one of the 7 classes. 6 out of 7 classes are also pretty domain-specific and not something like 'location' or 'person' or POS. I've been thinking about using an annotation framework (i.e. LabelStudio or Prodigy) that supports custom models (i.e. finetuned BERT) and active learning to rapidly increase annotated texts by speeding up the work of human annotators (which are domain experts and usually don't have a lot of time for this). However, it's pretty unclear whether my use case would be support by this, especially with the issue of sparse text. I've also considered making the problem easier by using another model for the specific class that applies to an entire paragraph. And/or by using the fully annotated paragraphs for finetuning and using the sparse paragraphs for validation. A final consideration is using GPT-3, but I'm not sure how/if it is able to classify entire sentences/paragraphs with multiple classes and how the prompt should be formatted as. Any suggestions/ideas? submitted by /u/Background_Claim7907 [link] [comments]  ( 44 min )
    [P] Framework agnostic python package for running RWKV, RNN based models.
    https://pypi.org/project/rwkvstic/ Currently supports tensorflow, pytorch, jax Also has support for tensor streaming, 8bit jit-quant and multi-gpu. Run RWKV 7B on 8GB of vram or 14B on 16GB of vram. submitted by /u/hazardous1222 [link] [comments]  ( 42 min )
    [R] META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 45 min )
    [D] Could forward-forward learning enable training large models with distributed computing?
    One problem with distributed learning with backprop is that the first layer can't update their weights until the computation has travelled all the way down to the last layer and then backpropagated back up. If all your layers are on different machines connected by a high-latency internet connection, this will take a long time. In forward-forward learning, learning is local - each layer operates independently and only needs to communicate with the layers above and below it. The results are almost-but-not-quite as good as backprop. But each layer can immediately update their weights based only on the information they received from the previous layer. Network latency no longer matters; the limit is just the bandwidth of the slowest machine. submitted by /u/currentscurrents [link] [comments]  ( 46 min )
  • Open

    Laptop Recommendations for RL
    I am looking to buy a laptop for my rl projects and I wanted to know what people in the industry recommended for training models locally and how significant OS, CPU and GPUs really are. submitted by /u/PleasantBase6967 [link] [comments]  ( 43 min )
    The value of RL feedback on language models: "[Character.ai] engagement rose by more than 30 percent." --Noam Shazeer
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Optimal machine to run 350M language model on?
    Trying to set up a discord bot. submitted by /u/Ah_Books [link] [comments]  ( 40 min )
  • Open

    Don't overfit the history -- Recursive time series data augmentation. (arXiv:2207.02891v2 [cs.LG] UPDATED)
    Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks, we need to understand that we fit our model on available data, which is a unique realized history. Training on a single realization often induces severe overfitting lacking generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call Recursive Interpolation Method, denoted as RIM. New samples are generated using a recursive interpolation function of all previous values in such a way that the enhanced samples preserve the original inherent time series dynamics. We perform theoretical analysis to characterize the proposed RIM and to guarantee its test performance. We apply RIM to diverse real world time series cases to achieve strong performance over non-augmented data on regression, classification, and reinforcement learning tasks.  ( 2 min )
    Interaction Decompositions for Tensor Network Regression. (arXiv:2208.06029v2 [cs.LG] UPDATED)
    It is well known that tensor network regression models operate on an exponentially large feature space, but questions remain as to how effectively they are able to utilize this space. Using a polynomial featurization, we propose the interaction decomposition as a tool that can assess the relative importance of different regressors as a function of their polynomial degree. We apply this decomposition to tensor ring and tree tensor network models trained on the MNIST and Fashion MNIST datasets, and find that up to 75% of interaction degrees are contributing meaningfully to these models. We also introduce a new type of tensor network model that is explicitly trained on only a small subset of interaction degrees, and find that these models are able to match or even outperform the full models using only a fraction of the exponential feature space. This suggests that standard tensor network models utilize their polynomial regressors in an inefficient manner, with the lower degree terms being vastly under-utilized.  ( 2 min )
    MusicLM: Generating Music From Text. (arXiv:2301.11325v1 [cs.SD])
    We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.  ( 2 min )
    f-divergences and their applications in lossy compression and bounding generalization error. (arXiv:2206.11042v3 [cs.IT] UPDATED)
    In this paper, we provide three applications for $f$-divergences: (i) we introduce Sanov's upper bound on the tail probability of the sum of independent random variables based on super-modular $f$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $f$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $f$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $f$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the $f$-rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds.  ( 2 min )
    Overcoming the Pitfalls of Prediction Error in Operator Learning for Bilevel Planning. (arXiv:2208.07737v2 [cs.AI] UPDATED)
    Bilevel planning, in which a high-level search over an abstraction of an environment is used to guide low-level decision-making, is an effective approach to solving long-horizon tasks in continuous state and action spaces. Recent work has shown how to enable such bilevel planning by learning action and transition model abstractions in the form of symbolic operators and neural samplers. In this work, we show that existing symbolic operator learning approaches fall short in many natural environments where agent actions tend to cause a large number of irrelevant propositions to change. This is primarily because they attempt to learn operators that optimize the prediction error with respect to observed changes in the propositions. To overcome this issue, we propose to learn operators that only model changes necessary for abstract planning to achieve the specified goal. Experimentally, we show that our approach learns operators that lead to efficient planning across 10 different hybrid robotics domains, including 4 from the challenging BEHAVIOR-100 benchmark, with generalization to novel initial states, goals, and objects.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning. (arXiv:2206.08657v3 [cs.CV] UPDATED)
    Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose Bridge-Tower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, Bridge-Tower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at \url{https://github.com/microsoft/BridgeTower}.  ( 2 min )
    Ontology-enhanced Prompt-tuning for Few-shot Learning. (arXiv:2201.11332v1 [cs.CL] CROSS LISTED)
    Few-shot Learning (FSL) is aimed to make predictions based on a limited number of samples. Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks. However, the priors adopted by the existing methods suffer from challenging knowledge missing, knowledge noise, and knowledge heterogeneity, which hinder the performance for few-shot learning. In this study, we explore knowledge injection for FSL with pre-trained language models and propose ontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop the ontology transformation based on the external knowledge graph to address the knowledge missing issue, which fulfills and converts structure knowledge to text. We further introduce span-sensitive knowledge injection via a visible matrix to select informative knowledge to handle the knowledge noise issue. To bridge the gap between knowledge and text, we propose a collective training algorithm to optimize representations jointly. We evaluate our proposed OntoPrompt in three tasks, including relation extraction, event extraction, and knowledge graph completion, with eight datasets. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.  ( 2 min )
    Predicting Wind-Driven Spatial Deposition through Simulated Color Images using Deep Autoencoders. (arXiv:2202.01762v3 [cs.LG] UPDATED)
    For centuries, scientists have observed nature to understand the laws that govern the physical world. The traditional process of turning observations into physical understanding is slow. Imperfect models are constructed and tested to explain relationships in data. Powerful new algorithms can enable computers to learn physics by observing images and videos. Inspired by this idea, instead of training machine learning models using physical quantities, we used images, that is, pixel information. For this work, and as a proof of concept, the physics of interest are wind-driven spatial patterns. These phenomena include features in Aeolian dunes and volcanic ash deposition, wildfire smoke, and air pollution plumes. We use computer model simulations of spatial deposition patterns to approximate images from a hypothetical imaging device whose outputs are red, green, and blue (RGB) color images with channel values ranging from 0 to 255. In this paper, we explore deep convolutional neural network-based autoencoders to exploit relationships in wind-driven spatial patterns, which commonly occur in geosciences, and reduce their dimensionality. Reducing the data dimension size with an encoder enables training deep, fully connected neural network models linking geographic and meteorological scalar input quantities to the encoded space. Once this is achieved, full spatial patterns are reconstructed using the decoder. We demonstrate this approach on images of spatial deposition from a pollution source, where the encoder compresses the dimensionality to 0.02% of the original size, and the full predictive model performance on test data achieves a normalized root mean squared error of 8%, a figure of merit in space of 94% and a precision-recall area under the curve of 0.93.  ( 3 min )
    Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models. (arXiv:2212.10561v2 [cs.CL] UPDATED)
    Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Adaptive Gradient Methods with Local Guarantees. (arXiv:2203.01400v3 [cs.LG] UPDATED)
    Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.  ( 2 min )
    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning. (arXiv:2301.10921v1 [cs.LG])
    The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning. To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data. We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold. We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.  ( 2 min )
    Explaining Visual Biases as Words by Generating Captions. (arXiv:2301.11104v1 [cs.LG])
    We aim to diagnose the potential biases in image classifiers. To this end, prior works manually labeled biased attributes or visualized biased features, which need high annotation costs or are often ambiguous to interpret. Instead, we leverage two types (generative and discriminative) of pre-trained vision-language models to describe the visual bias as a word. Specifically, we propose bias-to-text (B2T), which generates captions of the mispredicted images using a pre-trained captioning model to extract the common keywords that may describe visual biases. Then, we categorize the bias type as spurious correlation or majority bias by checking if it is specific or agnostic to the class, based on the similarity of class-wise mispredicted images and the keyword upon a pre-trained vision-language joint embedding space, e.g., CLIP. We demonstrate that the proposed simple and intuitive scheme can recover well-known gender and background biases, and discover novel ones in real-world datasets. Moreover, we utilize B2T to compare the classifiers using different architectures or training methods. Finally, we show that one can obtain debiased classifiers using the B2T bias keywords and CLIP, in both zero-shot and full-shot manners, without using any human annotation on the bias.  ( 2 min )
    On the Convergence of No-Regret Learning Dynamics in Time-Varying Games. (arXiv:2301.11241v1 [cs.LG])
    Most of the literature on learning in games has focused on the restrictive setting where the underlying repeated game does not change over time. Much less is known about the convergence of no-regret learning algorithms in dynamic multiagent settings. In this paper, we characterize the convergence of \emph{optimistic gradient descent (OGD)} in time-varying games by drawing a strong connection with \emph{dynamic regret}. Our framework yields sharp convergence bounds for the equilibrium gap of OGD in zero-sum games parameterized on the \emph{minimal} first-order variation of the Nash equilibria and the second-order variation of the payoff matrices, subsuming known results for static games. Furthermore, we establish improved \emph{second-order} variation bounds under strong convexity-concavity, as long as each game is repeated multiple times. Our results also apply to time-varying \emph{general-sum} multi-player games via a bilinear formulation of correlated equilibria, which has novel implications for meta-learning and for obtaining refined variation-dependent regret bounds, addressing questions left open in prior papers. Finally, we leverage our framework to also provide new insights on dynamic regret guarantees in static games.  ( 2 min )
    Box$^2$EL: Concept and Role Box Embeddings for the Description Logic EL++. (arXiv:2301.11118v1 [cs.AI])
    Representation learning in the form of semantic embeddings has been successfully applied to a variety of tasks in natural language processing and knowledge graphs. Recently, there has been growing interest in developing similar methods for learning embeddings of entire ontologies. We propose Box$^2$EL, a novel method for representation learning of ontologies in the Description Logic EL++, which represents both concepts and roles as boxes (i.e. axis-aligned hyperrectangles), such that the logical structure of the ontology is preserved. We theoretically prove the soundness of our model and conduct an extensive empirical evaluation, in which we achieve state-of-the-art results in subsumption prediction, link prediction, and deductive reasoning. As part of our evaluation, we introduce a novel benchmark for evaluating EL++ embedding models on predicting subsumptions involving both atomic and complex concepts.  ( 2 min )
    Finding Regions of Counterfactual Explanations via Robust Optimization. (arXiv:2301.11113v1 [cs.LG])
    Counterfactual explanations play an important role in detecting bias and improving the explainability of data-driven classification models. A counterfactual explanation (CE) is a minimal perturbed data point for which the decision of the model changes. Most of the existing methods can only provide one CE, which may not be achievable for the user. In this work we derive an iterative method to calculate robust CEs, i.e. CEs that remain valid even after the features are slightly perturbed. To this end, our method provides a whole region of CEs allowing the user to choose a suitable recourse to obtain a desired outcome. We use algorithmic ideas from robust optimization and prove convergence results for the most common machine learning methods including logistic regression, decision trees, random forests, and neural networks. Our experiments show that our method can efficiently generate globally optimal robust CEs for a variety of common data sets and classification models.  ( 2 min )
    Neural Inverse Operators for Solving PDE Inverse Problems. (arXiv:2301.11167v1 [cs.LG])
    A large class of inverse problems for PDEs are only well-defined as mappings from operators to functions. Existing operator learning frameworks map functions to functions and need to be modified to learn inverse maps from data. We propose a novel architecture termed Neural Inverse Operators (NIOs) to solve these PDE inverse problems. Motivated by the underlying mathematical structure, NIO is based on a suitable composition of DeepONets and FNOs to approximate mappings from operators to functions. A variety of experiments are presented to demonstrate that NIOs significantly outperform baselines and solve PDE inverse problems robustly, accurately and are several orders of magnitude faster than existing direct and PDE-constrained optimization methods.  ( 2 min )
    On the Mathematics of Diffusion Models. (arXiv:2301.11108v1 [cs.LG])
    This paper attempts to present the stochastic differential equations of diffusion models in a manner that is accessible to a broad audience. The diffusion process is defined over a population density in R^d. Of particular interest is a population of images. In a diffusion model one first defines a diffusion process that takes a sample from the population and gradually adds noise until only noise remains. The fundamental idea is to sample from the population by a reverse-diffusion process mapping pure noise to a population sample. The diffusion process is defined independent of any ``interpretation'' but can be analyzed using the mathematics of variational auto-encoders (the ``VAE interpretation'') or the Fokker-Planck equation (the ``score-matching intgerpretation''). Both analyses yield reverse-diffusion methods involving the score function. The Fokker-Planck analysis yields a family of reverse-diffusion SDEs parameterized by any desired level of reverse-diffusion noise including zero (deterministic reverse-diffusion). The VAE analysis yields the reverse-diffusion SDE at the same noise level as the diffusion SDE. The VAE analysis also yields a useful expression for computing the population probabilities of a given point (image). This formula for the probability of a given point does not seem to follow naturally from the Fokker-Planck analysis. Much, but apparently not all, of the mathematics presented here can be found in the literature. Attributions are given at the end of the paper.  ( 2 min )
    Causal Counterfactuals for Improving the Robustness of Reinforcement Learning. (arXiv:2211.05551v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is applied in a wide variety of fields. RL enables agents to learn tasks autonomously by interacting with the environment. The more critical the tasks are, the higher the demand for the robustness of the RL systems. Causal RL combines RL and causal inference to make RL more robust. Causal RL agents use a causal representation to capture the invariant causal mechanisms that can be transferred from one task to another. Currently, there is limited research in Causal RL, and existing solutions are usually not complete or feasible for real-world applications. In this work, we propose CausalCF, the first complete Causal RL solution incorporating ideas from Causal Curiosity and CoPhy. Causal Curiosity provides an approach for using interventions, and CoPhy is modified to enable the RL agent to perform counterfactuals. We apply CausalCF to complex robotic tasks and show that it improves the RL agent's robustness using a realistic simulation environment called CausalWorld.  ( 2 min )
    New Approach to Malware Detection Using Optimized Convolutional Neural Network. (arXiv:2301.11161v1 [cs.CR])
    Cyber-crimes have become a multi-billion-dollar industry in the recent years. Most cybercrimes/attacks involve deploying some type of malware. Malware that viciously targets every industry, every sector, every enterprise and even individuals has shown its capabilities to take entire business organizations offline and cause significant financial damage in billions of dollars annually. Malware authors are constantly evolving in their attack strategies and sophistication and are developing malware that is difficult to detect and can lay dormant in the background for quite some time in order to evade security controls. Given the above argument, Traditional approaches to malware detection are no longer effective. As a result, deep learning models have become an emerging trend to detect and classify malware. This paper proposes a new convolutional deep learning neural network to accurately and effectively detect malware with high precision. This paper is different than most other papers in the literature in that it uses an expert data science approach by developing a convolutional neural network from scratch to establish a baseline of the performance model first, explores and implements an improvement model from the baseline model, and finally it evaluates the performance of the final model. The baseline model initially achieves 98% accurate rate but after increasing the depth of the CNN model, its accuracy reaches 99.183 which outperforms most of the CNN models in the literature. Finally, to further solidify the effectiveness of this CNN model, we use the improved model to make predictions on new malware samples within our dataset.  ( 2 min )
    Planning Automated Driving with Accident Experience Referencing and Common-sense Inferencing. (arXiv:2301.10892v1 [cs.RO])
    Although a typical autopilot system far surpasses humans in term of sensing accuracy, performance stability and response agility, such a system is still far behind humans in the wisdom of understanding an unfamiliar environment with creativity, adaptivity and resiliency. Current AD brains are basically expert systems featuring logical computations, which resemble the thinking flow of a left brain working at tactical level. A right brain is needed to upgrade the safety of automated driving vehicle onto next generation by making intuitive strategical judgements that can supervise the tactical action planning. In this work, we present the concept of an Automated Driving Strategical Brain (ADSB): a framework of a scene perception and scene safety evaluation system that works at a higher abstraction level, incorporating experience referencing, common-sense inferring and goal-and-value judging capabilities, to provide a contextual perspective for decision making within automated driving planning. The ADSB brain architecture is made up of the Experience Referencing Engine (ERE), the Common-sense Referencing Engine (CIE) and the Goal and Value Keeper (GVK). 1,614,748 cases from FARS/CRSS database of NHTSA in the period 1975 to 2018 are used for the training of ERE model. The kernel of CIE is a trained model, COMET-BART by ATOMIC, which can be used to provide directional advice when tactical-level environmental perception conclusions are ambiguous; it can also use future scenario models to remind tactical-level decision systems to plan ahead of a perceived hazard scene. GVK can take in any additional expert-hand-written rules that are of qualitative nature. Moreover, we believe that with good scalability, the ADSB approach provides a potential solution to the problem of long-tail corner cases encountered in the validation of a rule-based planning algorithm.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    Efficient Hyperdimensional Computing. (arXiv:2301.10902v1 [cs.LG])
    Hyperdimensional computing (HDC) uses binary vectors of high dimensions to perform classification. Due to its simplicity and massive parallelism, HDC can be highly energy-efficient and well-suited for resource-constrained platforms. However, in trading off orthogonality with efficiency, hypervectors may use tens of thousands of dimensions. In this paper, we will examine the necessity for such high dimensions. In particular, we give a detailed theoretical analysis of the relationship among dimensions of hypervectors, accuracy, and orthogonality. The main conclusion of this study is that a much lower dimension, typically less than 100, can also achieve similar or even higher detecting accuracy compared with other state-of-the-art HDC models. Based on this insight, we propose a suite of novel techniques to build HDC models that use binary hypervectors of dimensions that are orders of magnitude smaller than those found in the state-of-the-art HDC models, yet yield equivalent or even improved accuracy and efficiency. For image classification, we achieved an HDC accuracy of 96.88\% with a dimension of only 32 on the MNIST dataset. We further explore our methods on more complex datasets like CIFAR-10 and show the limits of HDC computing.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Improving Graph Generation by Restricting Graph Bandwidth. (arXiv:2301.10857v1 [cs.LG])
    Deep graph generative modeling has proven capable of learning the distribution of complex, multi-scale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that significantly reduces the output space of existing graph generative models. Specifically, starting from the observation that many real-world graphs have low graph bandwidth, we restrict graph bandwidth during training and generation. Our strategy improves both generation scalability and quality without increasing architectural complexity or reducing expressiveness. Our approach is compatible with existing graph generative methods, and we describe its application to both autoregressive and one-shot models. We extensively validate our strategy on synthetic and real datasets, including molecular graphs. Our experiments show that, in addition to improving generation efficiency, our approach consistently improves generation quality and reconstruction accuracy. The implementation is made available.  ( 2 min )
    A Practical Influence Approximation for Privacy-Preserving Data Filtering in Federated Learning. (arXiv:2205.11518v2 [cs.CR] UPDATED)
    Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation (`lazy' influence) that can be implemented in a privacy-preserving manner. Each participant uses his own data to evaluate the influence of another participant's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for highly effective filtering of corrupted data in a variety of applications. Importantly, we show that most of the corrupted data can be filtered out (recall of $>90\%$, and even up to $100\%$), even under really strong privacy guarantees ($\varepsilon \leq 1$).  ( 2 min )
    Qualitative Analysis of a Graph Transformer Approach to Addressing Hate Speech: Adapting to Dynamically Changing Content. (arXiv:2301.10871v1 [cs.LG])
    Our work advances an approach for predicting hate speech in social media, drawing out the critical need to consider the discussions that follow a post to successfully detect when hateful discourse may arise. Using graph transformer networks, coupled with modelling attention and BERT-level natural language processing, our approach can capture context and anticipate upcoming anti-social behaviour. In this paper, we offer a detailed qualitative analysis of this solution for hate speech detection in social networks, leading to insights into where the method has the most impressive outcomes in comparison with competitors and identifying scenarios where there are challenges to achieving ideal performance. Included is an exploration of the kinds of posts that permeate social media today, including the use of hateful images. This suggests avenues for extending our model to be more comprehensive. A key insight is that the focus on reasoning about the concept of context positions us well to be able to support multi-modal analysis of online posts. We conclude with a reflection on how the problem we are addressing relates especially well to the theme of dynamic change, a critical concern for all AI solutions for social impact. We also comment briefly on how mental health well-being can be advanced with our work, through curated content attuned to the extent of hate in posts.  ( 2 min )
    Counterfactual Analysis in Dynamic Latent State Models. (arXiv:2205.13832v2 [cs.LG] UPDATED)
    We provide an optimization-based framework to perform counterfactual analysis in a dynamic model with hidden states. Our framework is grounded in the "abduction, action, and prediction" approach to answer counterfactual queries and handles two key challenges where (1) the states are hidden and (2) the model is dynamic. Recognizing the lack of knowledge on the underlying causal mechanism and the possibility of infinitely many such mechanisms, we optimize over this space and compute upper and lower bounds on the counterfactual quantity of interest. Our work brings together ideas from causality, state-space models, simulation, and optimization, and we apply it on a breast cancer case study. To the best of our knowledge, we are the first to compute lower and upper bounds on a counterfactual query in a dynamic latent-state model.  ( 2 min )
    Assistive Recipe Editing through Critiquing. (arXiv:2205.02454v2 [cs.CL] UPDATED)
    There has recently been growing interest in the automatic generation of cooking recipes that satisfy some form of dietary restrictions, thanks in part to the availability of online recipe data. Prior studies have used pre-trained language models, or relied on small paired recipe data (e.g., a recipe paired with a similar one that satisfies a dietary constraint). However, pre-trained language models generate inconsistent or incoherent recipes, and paired datasets are not available at scale. We address these deficiencies with RecipeCrit, a hierarchical denoising auto-encoder that edits recipes given ingredient-level critiques. The model is trained for recipe completion to learn semantic relationships within recipes. Our work's main innovation is our unsupervised critiquing module that allows users to edit recipes by interacting with the predicted ingredients; the system iteratively rewrites recipes to satisfy users' feedback. Experiments on the Recipe1M recipe dataset show that our model can more effectively edit recipes compared to strong language-modeling baselines, creating recipes that satisfy user constraints and are more correct, serendipitous, coherent, and relevant as measured by human judges.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Stochastic Online Fisher Markets: Static Pricing Limits and Adaptive Enhancements. (arXiv:2205.00825v3 [cs.GT] UPDATED)
    In a Fisher market, agents (users) spend a budget of (artificial) currency to buy goods that maximize their utilities while a central planner sets prices on capacity-constrained goods such that the market clears. However, the efficacy of pricing schemes in achieving an equilibrium outcome in Fisher markets typically relies on complete knowledge of users' budgets and utilities and requires that transactions happen in a static market wherein all users are present simultaneously. As a result, we study an online variant of Fisher markets, wherein budget-constrained users with privately known utility and budget parameters, drawn i.i.d. from a distribution $\mathcal{D}$, enter the market sequentially. In this setting, we develop an algorithm that adjusts prices solely based on observations of user consumption, i.e., revealed preference feedback, and achieves a regret and capacity violation of $O(\sqrt{n})$, where $n$ is the number of users and the good capacities scale as $O(n)$. Here, our regret measure is the optimality gap in the objective of the Eisenberg-Gale program between an online algorithm and an offline oracle with complete information on users' budgets and utilities. To establish the efficacy of our approach, we show that any uniform (static) pricing algorithm, including one that sets expected equilibrium prices with complete knowledge of the distribution $\mathcal{D}$, cannot achieve both a regret and constraint violation of less than $\Omega(\sqrt{n})$. While our revealed preference algorithm requires no knowledge of the distribution $\mathcal{D}$, we show that if $\mathcal{D}$ is known, then an adaptive variant of expected equilibrium pricing achieves $O(\log(n))$ regret and constant capacity violation for discrete distributions. Finally, we present numerical experiments to demonstrate the performance of our revealed preference algorithm relative to several benchmarks.  ( 3 min )
    Revisiting the Adversarial Robustness-Accuracy Tradeoff in Robot Learning. (arXiv:2204.07373v2 [cs.RO] UPDATED)
    Adversarial training (i.e., training on adversarially perturbed input data) is a well-studied method for making neural networks robust to potential adversarial attacks during inference. However, the improved robustness does not come for free but rather is accompanied by a decrease in overall model accuracy and performance. Recent work has shown that, in practical robot learning applications, the effects of adversarial training do not pose a fair trade-off but inflict a net loss when measured in holistic robot performance. This work revisits the robustness-accuracy trade-off in robot learning by systematically analyzing if recent advances in robust training methods and theory in conjunction with adversarial robot learning, are capable of making adversarial training suitable for real-world robot applications. We evaluate three different robot learning tasks ranging from autonomous driving in a high-fidelity environment amenable to sim-to-real deployment to mobile robot navigation and gesture recognition. Our results demonstrate that, while these techniques make incremental improvements on the trade-off on a relative scale, the negative impact on the nominal accuracy caused by adversarial training still outweighs the improved robustness by an order of magnitude. We conclude that although progress is happening, further advances in robust learning methods are necessary before they can benefit robot learning tasks in practice.  ( 2 min )
    Privacy preserving n-party scalar product protocol. (arXiv:2112.09436v4 [cs.CR] UPDATED)
    Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data, both on horizontal and vertically partitioned data. However, it relies on specialized techniques and algorithms to perform the necessary computations. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. Unfortunately, the solutions currently proposed in the literature focus mainly on two-party scenarios, even though scenarios with a higher number of data parties are becoming more relevant. For example when performing analyses that require counting the number of samples which fulfill certain criteria defined across various sites, such as calculating the information gain at a node in a decision tree. In this paper we propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method. Our proposed solution relies on a recursive resolution of smaller scalar products. After describing our proposed method, we discuss potential scalability issues. Finally, we describe the privacy guarantees and identify any concerns, as well as comparing the proposed method to the original solution in this aspect.  ( 2 min )
    A Light-weight Deep Human Activity Recognition Algorithm Using Multi-knowledge Distillation. (arXiv:2107.07331v4 [cs.LG] UPDATED)
    Inertial sensor-based human activity recognition (HAR) is the base of many human-centered mobile applications. Deep learning-based fine-grained HAR models enable accurate classification in various complex application scenarios. Nevertheless, the large storage and computational overhead of the existing fine-grained deep HAR models hinder their widespread deployment on resource-limited platforms. Inspired by the knowledge distillation's reasonable model compression and potential performance improvement capability, we design a multi-level HAR modeling pipeline called Stage-Logits-Memory Distillation (SMLDist) based on the widely-used MobileNet. By paying more attention to the frequency-related features during the distillation process, the SMLDist improves the HAR classification robustness of the students. We also propose an auto-search mechanism in the heterogeneous classifiers to improve classification performance. Extensive simulation results demonstrate that SMLDist outperforms various state-of-the-art HAR frameworks in accuracy and F1 macro score. The practical evaluation of the Jetson Xavier AGX platform shows that the SMLDist model is both energy-efficient and computation-efficient. These experiments validate the reasonable balance between the robustness and efficiency of the proposed model. The comparative experiments of knowledge distillation on six public datasets also demonstrate that the SMLDist outperforms other advanced knowledge distillation methods of students' performance, which verifies the good generalization of the SMLDist on other classification tasks, including but not limited to HAR.  ( 2 min )
    Textual Explanations and Critiques in Recommendation Systems. (arXiv:2205.07268v2 [cs.LG] UPDATED)
    Artificial intelligence and machine learning algorithms have become ubiquitous. Although they offer a wide range of benefits, their adoption in decision-critical fields is limited by their lack of interpretability, particularly with textual data. Moreover, with more data available than ever before, it has become increasingly important to explain automated predictions. Generally, users find it difficult to understand the underlying computational processes and interact with the models, especially when the models fail to generate the outcomes or explanations, or both, correctly. This problem highlights the growing need for users to better understand the models' inner workings and gain control over their actions. This dissertation focuses on two fundamental challenges of addressing this need. The first involves explanation generation: inferring high-quality explanations from text documents in a scalable and data-driven manner. The second challenge consists in making explanations actionable, and we refer to it as critiquing. This dissertation examines two important applications in natural language processing and recommendation tasks. Overall, we demonstrate that interpretability does not come at the cost of reduced performance in two consequential applications. Our framework is applicable to other fields as well. This dissertation presents an effective means of closing the gap between promise and practice in artificial intelligence.  ( 2 min )
    RoFL: Robustness of Secure Federated Learning. (arXiv:2107.03311v4 [cs.CR] UPDATED)
    Even though recent years have seen many attacks exposing severe vulnerabilities in Federated Learning (FL), a holistic understanding of what enables these attacks and how they can be mitigated effectively is still lacking. In this work, we demystify the inner workings of existing (targeted) attacks. We provide new insights into why these attacks are possible and why a definitive solution to FL robustness is challenging. We show that the need for ML algorithms to memorize tail data has significant implications for FL integrity. This phenomenon has largely been studied in the context of privacy; our analysis sheds light on its implications for ML integrity. We show that certain classes of severe attacks can be mitigated effectively by enforcing constraints such as norm bounds on clients' updates. We investigate how to efficiently incorporate these constraints into secure FL protocols in the single-server setting. Based on this, we propose RoFL, a new secure FL system that extends secure aggregation with privacy-preserving input validation. Specifically, RoFL can enforce constraints such as $L_2$ and $L_\infty$ bounds on high-dimensional encrypted model updates.  ( 2 min )
    Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. (arXiv:2203.11085v3 [cs.HC] UPDATED)
    Creating presentation slides is a critical but time-consuming task for data scientists. While researchers have proposed many AI techniques to lift data scientists' burden on data preparation and model selection, few have targeted the presentation creation task. Based on the needs identified from a formative study, this paper presents NB2Slides, an AI system that facilitates users to compose presentations of their data science work. NB2Slides uses deep learning methods as well as example-based prompts to generate slides from computational notebooks, and take users' input (e.g., audience background) to structure the slides. NB2Slides also provides an interactive visualization that links the slides with the notebook to help users further edit the slides. A follow-up user evaluation with 12 data scientists shows that participants believed NB2Slides can improve efficiency and reduces the complexity of creating slides. Yet, participants questioned the future of full automation and suggested a human-AI collaboration paradigm.  ( 2 min )
    Constant matters: Fine-grained Complexity of Differentially Private Continual Observation. (arXiv:2202.11205v5 [cs.DS] UPDATED)
    We study fine-grained error bounds for differentially private algorithms for counting under continual observation. Our main insight is that the matrix mechanism when using lower-triangular matrices can be used in the continual observation model. More specifically, we give an explicit factorization for the counting matrix $M_\mathsf{count}$ and upper bound the error explicitly. We also give a fine-grained analysis, specifying the exact constant in the upper bound. Our analysis is based on upper and lower bounds of the {\em completely bounded norm} (cb-norm) of $M_\mathsf{count}$. Along the way, we improve the best-known bound of 28 years by Mathias (SIAM Journal on Matrix Analysis and Applications, 1993) on the cb-norm of $M_\mathsf{count}$ for a large range of the dimension of $M_\mathsf{count}$. Furthermore, we are the first to give concrete error bounds for various problems under continual observation such as binary counting, maintaining a histogram, releasing an approximately cut-preserving synthetic graph, many graph-based statistics, and substring and episode counting. Finally, we note that our result can be used to get a fine-grained error bound for non-interactive local learning {and the first lower bounds on the additive error for $(\epsilon,\delta)$-differentially-private counting under continual observation.} Subsequent to this work, Henzinger et al. (SODA2023) showed that our factorization also achieves fine-grained mean-squared error.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Learning to Act Safely with Limited Exposure and Almost Sure Certainty. (arXiv:2105.08748v3 [eess.SY] UPDATED)
    This paper puts forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for an unbounded number of exploratory trials. This is indeed possible, provided that one is willing to navigate trade-offs between optimality, level of exposure to unsafe events, and the maximum detection time of unsafe actions. We illustrate this concept in two complementary settings. We first focus on the canonical multi-armed bandit problem and study the intrinsic trade-offs of learning safety in the presence of uncertainty. Under mild assumptions on sufficient exploration, we provide an algorithm that provably detects all unsafe machines in an (expected) finite number of rounds. The analysis also unveils a trade-off between the number of rounds needed to secure the environment and the probability of discarding safe machines. We then consider the problem of finding optimal policies for a Markov Decision Process (MDP) with almost sure constraints. We show that the action-value function satisfies a barrier-based decomposition which allows for the identification of feasible policies independently of the reward process. Using this decomposition, we develop a Barrier-learning algorithm, that identifies such unsafe state-action pairs in a finite expected number of steps. Our analysis further highlights a trade-off between the time lag for the underlying MDP necessary to detect unsafe actions, and the level of exposure to unsafe events. Simulations corroborate our theoretical findings, further illustrating the aforementioned trade-offs, and suggesting that safety constraints can speed up the learning process.  ( 2 min )
    Joint Training of Deep Ensembles Fails Due to Learner Collusion. (arXiv:2301.11323v1 [cs.LG])
    Ensembles of machine learning models have been well established as a powerful method of improving performance over a single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal of optimizing their joint performance. In the case of deep ensembles of neural networks, we are provided with the opportunity to directly optimize the true objective: the joint performance of the ensemble as a whole. Surprisingly, however, directly minimizing the loss of the ensemble appears to rarely be applied in practice. Instead, most previous research trains individual models independently with ensembling performed post hoc. In this work, we show that this is for good reason - joint optimization of ensemble loss results in degenerate behavior. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to demonstrate the practical implications of this effect finding that, in some cases, a balance between independent training and joint optimization can improve performance over the former while avoiding the degeneracies of the latter.  ( 2 min )
    A Context-based Multi-task Hierarchical Inverse Reinforcement Learning Algorithm. (arXiv:2210.01969v2 [cs.LG] UPDATED)
    Multi-task Imitation Learning (MIL) aims to train a policy capable of performing a distribution of tasks, which is essential for general-purpose robots, based on multi-task expert demonstrations. Existing MIL algorithms suffer from low data efficiency and poor performance on complex long-horizontal tasks. We develop Multi-task Hierarchical Adversarial Inverse Reinforcement Learning (MH-AIRL) to learn hierarchically-structured multi-task policies, which is more beneficial for compositional tasks with long horizons and has higher expert data efficiency through identifying and transferring reusable basic skills across tasks. To realize this, MH-AIRL effectively synthesizes context-based multi-task learning, AIRL (an IL approach), and hierarchical policy learning. Further, MH-AIRL can be adopted to demonstrations without the task or skill annotations (i.e., state-action pairs only) which are more accessible in practice. Theoretical justifications are provided for each module of MH-AIRL, and evaluations on challenging multi-task settings demonstrate superior performance and transferability of the multi-task policies learned with MH-AIRL as compared to SOTA MIL baselines.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Recursive deep learning framework for forecasting the decadal world economic outlook. (arXiv:2301.10874v1 [cs.LG])
    Gross domestic product (GDP) is the most widely used indicator in macroeconomics and the main tool for measuring a country's economic ouput. Due to the diversity and complexity of the world economy, a wide range of models have been used, but there are challenges in making decadal GDP forecasts given unexpected changes such as pandemics and wars. Deep learning models are well suited for modeling temporal sequences have been applied for time series forecasting. In this paper, we develop a deep learning framework to forecast the GDP growth rate of the world economy over a decade. We use Penn World Table as the source of our data, taking data from 1980 to 2019, across 13 countries, such as Australia, China, India, the United States and so on. We test multiple deep learning models, LSTM, BD-LSTM, ED-LSTM and CNN, and compared their results with the traditional time series model (ARIMA,VAR). Our results indicate that ED-LSTM is the best performing model. We present a recursive deep learning framework to predict the GDP growth rate in the next ten years. We predict that most countries will experience economic growth slowdown, stagnation or even recession within five years; only China, France and India are predicted to experience stable, or increasing, GDP growth.  ( 2 min )
    Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection. (arXiv:2211.09858v2 [cs.SD] UPDATED)
    Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.  ( 2 min )
    Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild. (arXiv:2208.07522v5 [cs.LG] UPDATED)
    Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Since moderation policies vary depending on countries and types of products, it is common to train and deploy the models per policy. However, this approach is highly inefficient, especially when the policies change, requiring dataset re-labeling and model re-training on the shifted data distribution. To alleviate this cost inefficiency, social media platforms often employ third-party content moderation services that provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons, instead of directly providing final moderation decisions. However, making a reliable automated moderation decision from the prediction scores of the multiple subtasks for a specific target policy has not been widely explored yet. In this study, we formulate real-world scenarios of content moderation and introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way. Extensive experiments demonstrate that our approach shows better performance in content moderation compared to existing threshold optimization methods and heuristics.  ( 2 min )
    Iterative Teaching by Label Synthesis. (arXiv:2110.14432v5 [cs.LG] UPDATED)
    In this paper, we consider the problem of iterative machine teaching, where a teacher provides examples sequentially based on the current iterative learner. In contrast to previous methods that have to scan over the entire pool and select teaching examples from it in each iteration, we propose a label synthesis teaching framework where the teacher randomly selects input teaching examples (e.g., images) and then synthesizes suitable outputs (e.g., labels) for them. We show that this framework can avoid costly example selection while still provably achieving exponential teachability. We propose multiple novel teaching algorithms in this framework. Finally, we empirically demonstrate the value of our framework.  ( 2 min )
    VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media. (arXiv:2208.09021v3 [cs.CV] UPDATED)
    We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, achieved by encoding images using a linear projection of patches instead of an object detector. However, it is pretrained on captioning datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data, there is a notable shift from captioning language data, as well as diversity of tasks. We indeed find evidence that the language capacity of ViLT is lacking. The key insight and novelty of VAuLT is to propagate the output representations of a large language model (LM) like BERT to the language input of ViLT. We show that joint training of the LM and ViLT can yield relative improvements up to 20% over ViLT and achieve state-of-the-art or comparable performance on VL tasks involving richer language inputs and affective constructs, such as for Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is available at https://github.com/gchochla/VAuLT.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Characterizing the Influence of Graph Elements. (arXiv:2210.07441v2 [cs.LG] UPDATED)
    Influence function, a method from robust statistics, measures the changes of model parameters or some functions about model parameters concerning the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need for expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph and formulated an influence function to approximate the changes in model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of an SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to guide the adversarial attacks on GCNs effectively.  ( 2 min )
    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. (arXiv:2210.13382v3 [cs.LG] UPDATED)
    Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.  ( 2 min )
    Review of Natural Language Processing in Pharmacology. (arXiv:2208.10228v2 [cs.CL] UPDATED)
    Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers.  ( 2 min )
    Cut and Learn for Unsupervised Object Detection and Instance Segmentation. (arXiv:2301.11320v1 [cs.CV])
    We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks using our robust loss function. We further improve the performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Certified Interpretability Robustness for Class Activation Mapping. (arXiv:2301.11324v1 [cs.LG])
    Interpreting machine learning models is challenging but crucial for ensuring the safety of deep networks in autonomous driving systems. Due to the prevalence of deep learning based perception models in autonomous vehicles, accurately interpreting their predictions is crucial. While a variety of such methods have been proposed, most are shown to lack robustness. Yet, little has been done to provide certificates for interpretability robustness. Taking a step in this direction, we present CORGI, short for Certifiably prOvable Robustness Guarantees for Interpretability mapping. CORGI is an algorithm that takes in an input image and gives a certifiable lower bound for the robustness of the top k pixels of its CAM interpretability map. We show the effectiveness of CORGI via a case study on traffic sign data, certifying lower bounds on the minimum adversarial perturbation not far from (4-5x) state-of-the-art attack methods.  ( 2 min )
    Predictive Crypto-Asset Automated Market Making Architecture for Decentralized Finance using Deep Reinforcement Learning. (arXiv:2211.01346v2 [q-fin.TR] UPDATED)
    The study proposes a quote-driven predictive automated market maker (AMM) platform with on-chain custody and settlement functions, alongside off-chain predictive reinforcement learning capabilities to improve liquidity provision of real-world AMMs. The proposed AMM architecture is an augmentation to the Uniswap V3, a cryptocurrency AMM protocol, by utilizing a novel market equilibrium pricing for reduced divergence and slippage loss. Further, the proposed architecture involves a predictive AMM capability, utilizing a deep hybrid Long Short-Term Memory (LSTM) and Q-learning reinforcement learning framework that looks to improve market efficiency through better forecasts of liquidity concentration ranges, so liquidity starts moving to expected concentration ranges, prior to asset price movement, so that liquidity utilization is improved. The augmented protocol framework is expected have practical real-world implications, by (i) reducing divergence loss for liquidity providers, (ii) reducing slippage for crypto-asset traders, while (iii) improving capital efficiency for liquidity provision for the AMM protocol. To our best knowledge, there are no known protocol or literature that are proposing similar deep learning-augmented AMM that achieves similar capital efficiency and loss minimization objectives for practical real-world applications.  ( 2 min )
    What you need to know to train recurrent neural networks to make Flip Flops memories and more. (arXiv:2010.07858v3 [cs.LG] UPDATED)
    Training neural networks to perform different tasks is relevant across various disciplines beyond Machine Learning. In particular, Recurrent Neural Networks (RNNs) are of great interest to different scientific communities. Open-source frameworks dedicated to Machine Learning, such as Tensorflow [1] and Keras [2] have produced significant changes in the development of technologies that we currently use. One relevant problem that can be approached with them is how to build the models to study dynamical systems and the brain. Specifically, how to extract the relevant information to answer the scientific questions of interest. The purpose of the present work is to contribute to this aim by analyzing a temporal processing task, in this case, a 3-bit Flip Flop memory. The modelling procedure in every step is shown: from equations to the software development. The networks obtained were analyzed to describe the dynamics and to show different visualization and analysis tools. The code developed in this premier is also provided to be used for modelling other tasks or systems.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning. (arXiv:2301.11321v1 [cs.LG])
    Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in an off-policy control task.  ( 2 min )
    Anatomy-aware and acquisition-agnostic joint registration with SynthMorph. (arXiv:2301.11329v1 [eess.IV])
    Affine image registration is a cornerstone of medical-image processing and analysis. While classical algorithms can achieve excellent accuracy, they solve a time-consuming optimization for every new image pair. Deep-learning (DL) methods learn a function that maps an image pair to an output transform. Evaluating the functions is fast, but capturing large transforms can be challenging, and networks tend to struggle if a test-image characteristic shifts from the training domain, such as the contrast or resolution. A majority of affine methods are also agnostic to the anatomy the user wishes to align; the registration will be inaccurate if algorithms consider all structures in the image. We address these shortcomings with a fast, robust, and easy-to-use DL tool for affine and deformable registration of any brain image without preprocessing, right off the MRI scanner. First, we rigorously analyze how competing architectures learn affine transforms across a diverse set of neuroimaging data, aiming to truly capture the behavior of methods in the real world. Second, we leverage a recent strategy to train networks with wildly varying images synthesized from label maps, yielding robust performance across acquisition specifics. Third, we optimize the spatial overlap of select anatomical labels, which enables networks to distinguish between anatomy of interest and irrelevant structures, removing the need for preprocessing that excludes content that would otherwise reduce the accuracy of anatomy-specific registration. We combine the affine model with prior work on deformable registration and test brain-specific registration across a landscape of MRI protocols unseen at training, demonstrating consistent and improved accuracy compared to existing tools. We distribute our code and tool at https://w3id.org/synthmorph, providing a single complete end-to-end solution for registration of brain MRI.  ( 3 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    ZiCo: Zero-shot NAS via Inverse Coefficient of Variation on Gradients. (arXiv:2301.11300v1 [cs.LG])
    Neural Architecture Search (NAS) is widely used to automatically design the neural network with the best performance among a large number of candidate architectures. To reduce the search time, zero-shot NAS aims at designing training-free proxies that can predict the test performance of a given architecture. However, as shown recently, none of the zero-shot proxies proposed to date can actually work consistently better than a naive proxy, namely, the number of network parameters (#Params). To improve this state of affairs, as the main theoretical contribution, we first reveal how some specific gradient properties across different samples impact the convergence rate and generalization capacity of neural networks. Based on this theoretical analysis, we propose a new zero-shot proxy, ZiCo, the first proxy that works consistently better than #Params. We demonstrate that ZiCo works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS, TransNASBench-101) for multiple applications (e.g., image classification/reconstruction and pixel-level prediction). Finally, we demonstrate that the optimal architectures found via ZiCo are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, ZiCo-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs on ImageNet within 0.4 GPU days.  ( 2 min )
    Open Problems in Applied Deep Learning. (arXiv:2301.11316v1 [cs.LG])
    This work formulates the machine learning mechanism as a bi-level optimization problem. The inner level optimization loop entails minimizing a properly chosen loss function evaluated on the training data. This is nothing but the well-studied training process in pursuit of optimal model parameters. The outer level optimization loop is less well-studied and involves maximizing a properly chosen performance metric evaluated on the validation data. This is what we call the "iteration process", pursuing optimal model hyper-parameters. Among many other degrees of freedom, this process entails model engineering (e.g., neural network architecture design) and management, experiment tracking, dataset versioning and augmentation. The iteration process could be automated via Automatic Machine Learning (AutoML) or left to the intuitions of machine learning students, engineers, and researchers. Regardless of the route we take, there is a need to reduce the computational cost of the iteration step and as a direct consequence reduce the carbon footprint of developing artificial intelligence algorithms. Despite the clean and unified mathematical formulation of the iteration step as a bi-level optimization problem, its solutions are case specific and complex. This work will consider such cases while increasing the level of complexity from supervised learning to semi-supervised, self-supervised, unsupervised, few-shot, federated, reinforcement, and physics-informed learning. As a consequence of this exercise, this proposal surfaces a plethora of open problems in the field, many of which can be addressed in parallel.  ( 2 min )
    BayesSpeech: A Bayesian Transformer Network for Automatic Speech Recognition. (arXiv:2301.11276v1 [eess.AS])
    Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks. These models tend to be lighter weight and require less training time than traditional RNN-based approaches. However, these models take frequentist approach to weight training. In theory, network weights are drawn from a latent, intractable probability distribution. We introduce BayesSpeech for end-to-end Automatic Speech Recognition. BayesSpeech is a Bayesian Transformer Network where these intractable posteriors are learned through variational inference and the local reparameterization trick without recurrence. We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Understanding Finetuning for Factual Knowledge Extraction from Language Models. (arXiv:2301.11293v1 [cs.CL])
    Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful) phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1- model mixing and 2- mixture finetuning with the LM's pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.  ( 2 min )
    Text-To-4D Dynamic Scene Generation. (arXiv:2301.11280v1 [cs.CV])
    We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.  ( 2 min )
    Classification of vertices on social networks by multiple approaches. (arXiv:2301.11288v1 [cs.SI])
    Due to the advent of the expressions of data other than tabular formats, the topological compositions which make samples interrelated came into prominence. Analogically, those networks can be interpreted as social connections, dataflow maps, citation influence graphs, protein bindings, etc. However, in the case of social networks, it is highly crucial to evaluate the labels of discrete communities. The reason underneath for such a study is the non-negligible importance of analyzing graph networks to partition the vertices by using the topological features of network graphs, solely. For each of these interaction-based entities, a social graph, a mailing dataset, and two citation sets are selected as the testbench repositories. This paper, it was not only assessed the most valuable method but also determined how graph neural networks work and the need to improve against non-neural network approaches which are faster and computationally cost-effective. Also, this paper showed a limit to be excesses by prospective graph neural network variations by using the topological features of networks trialed.  ( 2 min )
    Gaussian process regression and conditional Karhunen-Lo\'{e}ve models for data assimilation in inverse problems. (arXiv:2301.11279v1 [cs.LG])
    We present a model inversion algorithm, CKLEMAP, for data assimilation and parameter estimation in partial differential equation models of physical systems with spatially heterogeneous parameter fields. These fields are approximated using low-dimensional conditional Karhunen-Lo\'{e}ve expansions, which are constructed using Gaussian process regression models of these fields trained on the parameters' measurements. We then assimilate measurements of the state of the system and compute the maximum a posteriori estimate of the CKLE coefficients by solving a nonlinear least-squares problem. When solving this optimization problem, we efficiently compute the Jacobian of the vector objective by exploiting the sparsity structure of the linear system of equations associated with the forward solution of the physics problem. The CKLEMAP method provides better scalability compared to the standard MAP method. In the MAP method, the number of unknowns to be estimated is equal to the number of elements in the numerical forward model. On the other hand, in CKLEMAP, the number of unknowns (CKLE coefficients) is controlled by the smoothness of the parameter field and the number of measurements, and is in general much smaller than the number of discretization nodes, which leads to a significant reduction of computational cost with respect to the standard MAP method. To show its advantage in scalability, we apply CKLEMAP to estimate the transmissivity field in a two-dimensional steady-state subsurface flow model of the Hanford Site by assimilating synthetic measurements of transmissivity and hydraulic head. We find that the execution time of CKLEMAP scales nearly linearly as $N^{1.33}$, where $N$ is the number of discretization nodes, while the execution time of standard MAP scales as $N^{2.91}$. The CKLEMAP method improved execution time without sacrificing accuracy when compared to the standard MAP.  ( 3 min )
    Real-Time Digital Twins: Vision and Research Directions for 6G and Beyond. (arXiv:2301.11283v1 [eess.SP])
    This article presents a vision where \textit{real-time} digital twins of the physical wireless environments are continuously updated using multi-modal sensing data from the distributed infrastructure and user devices, and are used to make communication and sensing decisions. This vision is mainly enabled by the advances in precise 3D maps, multi-modal sensing, ray-tracing computations, and machine/deep learning. This article details this vision, explains the different approaches for constructing and utilizing these real-time digital twins, discusses the applications and open problems, and presents a research platform that can be used to investigate various digital twin research directions.  ( 2 min )
    Online Convex Optimization with Stochastic Constraints: Zero Constraint Violation and Bandit Feedback. (arXiv:2301.11267v1 [math.OC])
    This paper studies online convex optimization with stochastic constraints. We propose a variant of the drift-plus-penalty algorithm that guarantees $O(\sqrt{T})$ expected regret and zero constraint violation, after a fixed number of iterations, which improves the vanilla drift-plus-penalty method with $O(\sqrt{T})$ constraint violation. Our algorithm is oblivious to the length of the time horizon $T$, in contrast to the vanilla drift-plus-penalty method. This is based on our novel drift lemma that provides time-varying bounds on the virtual queue drift and, as a result, leads to time-varying bounds on the expected virtual queue length. Moreover, we extend our framework to stochastic-constrained online convex optimization under two-point bandit feedback. We show that by adapting our algorithmic framework to the bandit feedback setting, we may still achieve $O(\sqrt{T})$ expected regret and zero constraint violation, improving upon the previous work for the case of identical constraint functions. Numerical results demonstrate our theoretical results.  ( 2 min )
    AlignGraph: A Group of Generative Models for Graphs. (arXiv:2301.11273v1 [cs.SI])
    It is challenging for generative models to learn a distribution over graphs because of the lack of permutation invariance: nodes may be ordered arbitrarily across graphs, and standard graph alignment is combinatorial and notoriously expensive. We propose AlignGraph, a group of generative models that combine fast and efficient graph alignment methods with a family of deep generative models that are invariant to node permutations. Our experiments demonstrate that our framework successfully learns graph distributions, outperforming competitors by 25% -560% in relevant performance scores.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    A Benchmark Study by using various Machine Learning Models for Predicting Covid-19 trends. (arXiv:2301.11257v1 [cs.LG])
    Machine learning and deep learning play vital roles in predicting diseases in the medical field. Machine learning algorithms are widely classified as supervised, unsupervised, and reinforcement learning. This paper contains a detailed description of our experimental research work in that we used a supervised machine-learning algorithm to build our model for outbreaks of the novel Coronavirus that has spread over the whole world and caused many deaths, which is one of the most disastrous Pandemics in the history of the world. The people suffered physically and economically to survive in this lockdown. This work aims to understand better how machine learning, ensemble, and deep learning models work and are implemented in the real dataset. In our work, we are going to analyze the current trend or pattern of the coronavirus and then predict the further future of the covid-19 confirmed cases or new cases by training the past Covid-19 dataset by using the machine learning algorithm such as Linear Regression, Polynomial Regression, K-nearest neighbor, Decision Tree, Support Vector Machine and Random forest algorithm are used to train the model. The decision tree and the Random Forest algorithm perform better than SVR in this work. The performance of SVR and lasso regression are low in all prediction areas Because the SVR is challenging to separate the data using the hyperplane for this type of problem. So SVR mostly gives a lower performance in this problem. Ensemble (Voting, Bagging, and Stacking) and deep learning models(ANN) also predict well. After the prediction, we evaluated the model using MAE, MSE, RMSE, and MAPE. This work aims to find the trend/pattern of the covid-19.  ( 3 min )
    Molecular Language Model as Multi-task Generator. (arXiv:2301.11259v1 [cs.LG])
    Molecule generation with desired properties has grown immensely in popularity by disruptively changing the way scientists design molecular structures and providing support for chemical and materials design. However, despite the promising outcome, previous machine learning-based deep generative models suffer from a reliance on complex, task-specific fine-tuning, limited dimensional latent spaces, or the quality of expert rules. In this work, we propose MolGen, a pre-trained molecular language model that effectively learns and shares knowledge across multiple generation tasks and domains. Specifically, we pre-train MolGen with the chemical language SELFIES on more than 100 million unlabelled molecules. We further propose multi-task molecular prefix tuning across several molecular generation tasks and different molecular domains (synthetic & natural products) with a self-feedback mechanism. Extensive experiments show that MolGen can obtain superior performances on well-known molecular generation benchmark datasets. The further analysis illustrates that MolGen can accurately capture the distribution of molecules, implicitly learn their structural characteristics, and efficiently explore the chemical space with the guidance of multi-task molecular prefix tuning. Codes, datasets, and the pre-trained model will be available in https://github.com/zjunlp/MolGen.  ( 2 min )
    BiBench: Benchmarking and Analyzing Network Binarization. (arXiv:2301.11233v1 [cs.CV])
    Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research.  ( 2 min )
    Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data. (arXiv:2301.11174v1 [cs.CV])
    We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In contrast to manually annotating all the training samples, separately collecting uni-modal datasets is immensely easier, e.g., a large-scale image dataset and a sentence dataset. We leverage such massive unpaired image and caption data upon standard paired data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples in an adversarial learning fashion, where the joint distribution of image and caption is learned. Our method trains a captioner to learn from a paired data and to progressively associate unpaired data. This approach shows noticeable performance improvement even in challenging scenarios including out-of-task data (i.e., relational captioning, where the target task is different from the unpaired data) and web-crawled data. We also show that our proposed method is theoretically well-motivated and has a favorable global optimal property. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired COCO dataset demonstrate the consistent effectiveness of our semisupervised learning method with unpaired data compared to competing methods.  ( 2 min )
    Causal Graph Discovery from Self and Mutually Exciting Time Series. (arXiv:2301.11197v1 [cs.LG])
    We present a generalized linear structural causal model, coupled with a novel data-adaptive linear regularization, to recover causal directed acyclic graphs (DAGs) from time series. By leveraging a recently developed stochastic monotone Variational Inequality (VI) formulation, we cast the causal discovery problem as a general convex optimization. Furthermore, we develop a non-asymptotic recovery guarantee and quantifiable uncertainty by solving a linear program to establish confidence intervals for a wide range of non-linear monotone link functions. We validate our theoretical results and show the competitive performance of our method via extensive numerical experiments. Most importantly, we demonstrate the effectiveness of our approach in recovering highly interpretable causal DAGs over Sepsis Associated Derangements (SADs) while achieving comparable prediction performance to powerful ``black-box'' models such as XGBoost. Thus, the future adoption of our proposed method to conduct continuous surveillance of high-risk patients by clinicians is much more likely.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    A Graph Neural Network with Negative Message Passing for Graph Coloring. (arXiv:2301.11164v1 [cs.LG])
    Graph neural networks have received increased attention over the past years due to their promising ability to handle graph-structured data, which can be found in many real-world problems such as recommended systems and drug synthesis. Most existing research focuses on using graph neural networks to solve homophilous problems, but little attention has been paid to heterophily-type problems. In this paper, we propose a graph network model for graph coloring, which is a class of representative heterophilous problems. Different from the conventional graph networks, we introduce negative message passing into the proposed graph neural network for more effective information exchange in handling graph coloring problems. Moreover, a new loss function taking into account the self-information of the nodes is suggested to accelerate the learning process. Experimental studies are carried out to compare the proposed graph model with five state-of-the-art algorithms on ten publicly available graph coloring problems and one real-world application. Numerical results demonstrate the effectiveness of the proposed graph neural network.  ( 2 min )
    Deep Laplacian-based Options for Temporally-Extended Exploration. (arXiv:2301.11181v1 [cs.LG])
    Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.  ( 2 min )
    Flex-Net: A Graph Neural Network Approach to Resource Management in Flexible Duplex Networks. (arXiv:2301.11166v1 [cs.NI])
    Flexible duplex networks allow users to dynamically employ uplink and downlink channels without static time scheduling, thereby utilizing the network resources efficiently. This work investigates the sum-rate maximization of flexible duplex networks. In particular, we consider a network with pairwise-fixed communication links. Corresponding combinatorial optimization is a non-deterministic polynomial (NP)-hard without a closed-form solution. In this respect, the existing heuristics entail high computational complexity, raising a scalability issue in large networks. Motivated by the recent success of Graph Neural Networks (GNNs) in solving NP-hard wireless resource management problems, we propose a novel GNN architecture, named Flex-Net, to jointly optimize the communication direction and transmission power. The proposed GNN produces near-optimal performance meanwhile maintaining a low computational complexity compared to the most commonly used techniques. Furthermore, our numerical results shed light on the advantages of using GNNs in terms of sample complexity, scalability, and generalization capability.  ( 2 min )
    Train Hard, Fight Easy: Robust Meta Reinforcement Learning. (arXiv:2301.11147v1 [cs.LG])
    A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we propose a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL often leads to both biased gradients and data inefficiency. We prove that the former disappears in MRL, and address the latter via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML learns substantially different meta-policies and achieves robust returns on several navigation and continuous control benchmarks.  ( 2 min )
    Which Experiences Are Influential for Your Agent? Policy Iteration with Turn-over Dropout. (arXiv:2301.11168v1 [cs.LG])
    In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about the influence is valuable for various purposes, including experience cleansing and analysis. One method for estimating the influence of individual experiences is agent comparison, but it is prohibitively expensive when there is a large number of experiences. In this paper, we present PI+ToD as a method for efficiently estimating the influence of experiences. PI+ToD is a policy iteration that efficiently estimates the influence of experiences by utilizing turn-over dropout. We demonstrate the efficiency of PI+ToD with experiments in MuJoCo environments.  ( 2 min )
    Convolutional Learning on Simplicial Complexes. (arXiv:2301.11163v1 [cs.LG])
    We propose a simplicial complex convolutional neural network (SCCNN) to learn data representations on simplicial complexes. It performs convolutions based on the multi-hop simplicial adjacencies via common faces and cofaces independently and captures the inter-simplicial couplings, generalizing state-of-the-art. Upon studying symmetries of the simplicial domain and the data space, it is shown to be permutation and orientation equivariant, thus, incorporating such inductive biases. Based on the Hodge theory, we perform a spectral analysis to understand how SCCNNs regulate data in different frequencies, showing that the convolutions via faces and cofaces operate in two orthogonal data spaces. Lastly, we study the stability of SCCNNs to domain deformations and examine the effects of various factors. Empirical results show the benefits of higher-order convolutions and inter-simplicial couplings in simplex prediction and trajectory prediction.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Federated Learning over Coupled Graphs. (arXiv:2301.11099v1 [cs.LG])
    Graphs are widely used to represent the relations among entities. When one owns the complete data, an entire graph can be easily built, therefore performing analysis on the graph is straightforward. However, in many scenarios, it is impractical to centralize the data due to data privacy concerns. An organization or party only keeps a part of the whole graph data, i.e., graph data is isolated from different parties. Recently, Federated Learning (FL) has been proposed to solve the data isolation issue, mainly for Euclidean data. It is still a challenge to apply FL on graph data because graphs contain topological information which is notorious for its non-IID nature and is hard to partition. In this work, we propose a novel FL framework for graph data, FedCog, to efficiently handle coupled graphs that are a kind of distributed graph data, but widely exist in a variety of real-world applications such as mobile carriers' communication networks and banks' transaction networks. We theoretically prove the correctness and security of FedCog. Experimental results demonstrate that our method FedCog significantly outperforms traditional FL methods on graphs. Remarkably, our FedCog improves the accuracy of node classification tasks by up to 14.7%.  ( 2 min )
    FedHQL: Federated Heterogeneous Q-Learning. (arXiv:2301.11135v1 [cs.LG])
    Federated Reinforcement Learning (FedRL) encourages distributed agents to learn collectively from each other's experience to improve their performance without exchanging their raw trajectories. The existing work on FedRL assumes that all participating agents are homogeneous, which requires all agents to share the same policy parameterization (e.g., network architectures and training configurations). However, in real-world applications, agents are often in disagreement about the architecture and the parameters, possibly also because of disparate computational budgets. Because homogeneity is not given in practice, we introduce the problem setting of Federated Reinforcement Learning with Heterogeneous And bLack-box agEnts (FedRL-HALE). We present the unique challenges this new setting poses and propose the Federated Heterogeneous Q-Learning (FedHQL) algorithm that principally addresses these challenges. We empirically demonstrate the efficacy of FedHQL in boosting the sample efficiency of heterogeneous agents with distinct policy parameterization using standard RL tasks.  ( 2 min )
    Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning. (arXiv:2301.11153v1 [cs.LG])
    Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.  ( 2 min )
    SQ Lower Bounds for Random Sparse Planted Vector Problem. (arXiv:2301.11124v1 [cs.LG])
    Consider the setting where a $\rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $n\geq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $n\ll \rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $n\ll \rho^2 d^{2}$ and $\rho\gg \frac{1}{\sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minerva: A File-Based Ransomware Detector. (arXiv:2301.11050v1 [cs.CR])
    Ransomware is a rapidly evolving type of malware designed to encrypt user files on a device, making them inaccessible in order to exact a ransom. Ransomware attacks resulted in billions of dollars in damages in recent years and are expected to cause hundreds of billions more in the next decade. With current state-of-the-art process-based detectors being heavily susceptible to evasion attacks, no comprehensive solution to this problem is available today. This paper presents Minerva, a new approach to ransomware detection. Unlike current methods focused on identifying ransomware based on process-level behavioral modeling, Minerva detects ransomware by building behavioral profiles of files based on all the operations they receive in a time window. Minerva addresses some of the critical challenges associated with process-based approaches, specifically their vulnerability to complex evasion attacks. Our evaluation of Minerva demonstrates its effectiveness in detecting ransomware attacks, including those that are able to bypass existing defenses. Our results show that Minerva identifies ransomware activity with an average accuracy of 99.45% and an average recall of 99.66%, with 99.97% of ransomware detected within 1 second.  ( 2 min )
    Incomplete Multi-view Clustering via Prototype-based Imputation. (arXiv:2301.11045v1 [cs.LG])
    In this paper, we study how to achieve two characteristics highly-expected by incomplete multi-view clustering (IMvC). Namely, i) instance commonality refers to that within-cluster instances should share a common pattern, and ii) view versatility refers to that cross-view samples should own view-specific patterns. To this end, we design a novel dual-stream model which employs a dual attention layer and a dual contrastive learning loss to learn view-specific prototypes and model the sample-prototype relationship. When the view is missed, our model performs data recovery using the prototypes in the missing view and the sample-prototype relationship inherited from the observed view. Thanks to our dual-stream model, both cluster- and view-specific information could be captured, and thus the instance commonality and view versatility could be preserved to facilitate IMvC. Extensive experiments demonstrate the superiority of our method on six challenging benchmarks compared with 11 approaches. The code will be released.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices. (arXiv:2301.10999v1 [cs.LG])
    The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortunately, these metrics are slow to evaluate using simulators (where available) and typically require measurement on the target hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that predicts inference latency, energy, and memory footprint on an arbitrary DNN TFlite graph (TFL, 2017). In contrast, previously published performance predictors can only predict latency and are restricted to pre-defined construction rules or search spaces. This paper also describes the EdgeDLPerf dataset of 134,912 DNNs randomly sampled from four task search spaces and annotated with inference performance metrics from three edge hardware platforms. Using this dataset, we train PerfSAGE and provide experimental results that demonstrate state-of-the-art prediction accuracy with a Mean Absolute Percentage Error of <5% across all targets and model search spaces. These results: (1) Outperform previous state-of-art GNN-based predictors (Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3) Demonstrate predictions on arbitrary input graphs without modifications to the feature extractor.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Multi-Agent congestion cost minimization with linear function approximation. (arXiv:2301.10993v1 [cs.LG])
    This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.  ( 2 min )
    Time-sensitive Learning for Heterogeneous Federated Edge Intelligence. (arXiv:2301.10977v1 [cs.LG])
    Real-time machine learning has recently attracted significant interest due to its potential to support instantaneous learning, adaptation, and decision making in a wide range of application domains, including self-driving vehicles, intelligent transportation, and industry automation. We investigate real-time ML in a federated edge intelligence (FEI) system, an edge computing system that implements federated learning (FL) solutions based on data samples collected and uploaded from decentralized data networks. FEI systems often exhibit heterogenous communication and computational resource distribution, as well as non-i.i.d. data samples, resulting in long model training time and inefficient resource utilization. Motivated by this fact, we propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model. Training acceleration solutions for both TS-FL with synchronous coordination (TS-FL-SC) and asynchronous coordination (TS-FL-ASC) are investigated. To address straggler effect in TS-FL-SC, we develop an analytical solution to characterize the impact of selecting different subsets of edge servers on the overall model training time. A server dropping-based solution is proposed to allow slow-performance edge servers to be removed from participating in model training if their impact on the resulting model accuracy is limited. A joint optimization algorithm is proposed to minimize the overall time consumption of model training by selecting participating edge servers, local epoch number. We develop an analytical expression to characterize the impact of staleness effect of asynchronous coordination and straggler effect of FL on the time consumption of TS-FL-ASC. Experimental results show that TS-FL-SC and TS-FL-ASC can provide up to 63% and 28% of reduction, in the overall model training time, respectively.  ( 2 min )
    A Fully First-Order Method for Stochastic Bilevel Optimization. (arXiv:2301.10945v1 [math.OC])
    We consider stochastic unconstrained bilevel optimization problems when only the first-order gradient oracles are available. While numerous optimization methods have been proposed for tackling bilevel problems, existing methods either tend to require possibly expensive calculations regarding Hessians of lower-level objectives, or lack rigorous finite-time performance guarantees. In this work, we propose a Fully First-order Stochastic Approximation (F2SA) method, and study its non-asymptotic convergence properties. Specifically, we show that F2SA converges to an $\epsilon$-stationary solution of the bilevel problem after $\epsilon^{-7/2}, \epsilon^{-5/2}$, and $\epsilon^{-3/2}$ iterations (each iteration using $O(1)$ samples) when stochastic noises are in both level objectives, only in the upper-level objective, and not present (deterministic settings), respectively. We further show that if we employ momentum-assisted gradient estimators, the iteration complexities can be improved to $\epsilon^{-5/2}, \epsilon^{-4/2}$, and $\epsilon^{-3/2}$, respectively. We demonstrate even superior practical performance of the proposed method over existing second-order based approaches on MNIST data-hypercleaning experiments.  ( 2 min )
    Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure. (arXiv:2301.10956v1 [cs.LG])
    Graph Neural Networks (GNNs) are popular models for graph learning problems. GNNs show strong empirical performance in many practical tasks. However, the theoretical properties have not been completely elucidated. In this paper, we investigate whether GNNs can exploit the graph structure from the perspective of the expressive power of GNNs. In our analysis, we consider graph generation processes that are controlled by hidden node features, which contain all information about the graph structure. A typical example of this framework is kNN graphs constructed from the hidden features. In our main results, we show that GNNs can recover the hidden node features from the input graph alone, even when all node features, including the hidden features themselves and any indirect hints, are unavailable. GNNs can further use the recovered node features for downstream tasks. These results show that GNNs can fully exploit the graph structure by themselves, and in effect, GNNs can use both the hidden and explicit node features for downstream tasks. In the experiments, we confirm the validity of our results by showing that GNNs can accurately recover the hidden features using a GNN architecture built based on our theoretical analysis.  ( 2 min )
    Visiting Distant Neighbors in Graph Convolutional Networks. (arXiv:2301.10960v1 [cs.LG])
    We extend the graph convolutional network method for deep learning on graph data to higher order in terms of neighboring nodes. In order to construct representations for a node in a graph, in addition to the features of the node and its immediate neighboring nodes, we also include more distant nodes in the calculations. In experimenting with a number of publicly available citation graph datasets, we show that this higher order neighbor visiting pays off by outperforming the original model especially when we have a limited number of available labeled data points for the training of the model.  ( 2 min )
    Privacy-Preserving Joint Edge Association and Power Optimization for the Internet of Vehicles via Federated Multi-Agent Reinforcement Learning. (arXiv:2301.11014v1 [cs.LG])
    Proactive edge association is capable of improving wireless connectivity at the cost of increased handover (HO) frequency and energy consumption, while relying on a large amount of private information sharing required for decision making. In order to improve the connectivity-cost trade-off without privacy leakage, we investigate the privacy-preserving joint edge association and power allocation (JEAPA) problem in the face of the environmental uncertainty and the infeasibility of individual learning. Upon modelling the problem by a decentralized partially observable Markov Decision Process (Dec-POMDP), it is solved by federated multi-agent reinforcement learning (FMARL) through only sharing encrypted training data for federatively learning the policy sought. Our simulation results show that the proposed solution strikes a compelling trade-off, while preserving a higher privacy level than the state-of-the-art solutions.  ( 2 min )
    Neural Dynamic Focused Topic Model. (arXiv:2301.10988v1 [cs.CL])
    Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by Williamson et al. (2010), such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics. In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the dynamic Focused Topic Model. Indeed, we develop a neural model for topic evolution which exploits sequences of Bernoulli random variables in order to track the appearances of topics, thereby decoupling their activities from their proportions. We evaluate our model on three different datasets (the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset) and show that it (i) outperforms state-of-the-art topic models in generalization tasks and (ii) performs comparably to them on prediction tasks, while employing roughly the same number of parameters, and converging about two times faster. Source code to reproduce our experiments is available online.  ( 2 min )
    On the Importance of Noise Scheduling for Diffusion Models. (arXiv:2301.10972v1 [cs.CV])
    We empirically study the effect of noise scheduling strategies for denoising diffusion generative models. There are three findings: (1) the noise scheduling is crucial for the performance, and the optimal one depends on the task (e.g., image sizes), (2) when increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels), and (3) simply scaling the input data by a factor of $b$ while keeping the noise schedule function fixed (equivalent to shifting the logSNR by $\log b$) is a good strategy across image sizes. This simple recipe, when combined with recently proposed Recurrent Interface Network (RIN), yields state-of-the-art pixel-based diffusion models for high-resolution images on ImageNet, enabling single-stage, end-to-end generation of diverse and high-fidelity images at 1024$\times$1024 resolution for the first time (without upsampling/cascades).  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    SparDA: Accelerating Dynamic Sparse Deep Neural Networks via Sparse-Dense Transformation. (arXiv:2301.10936v1 [cs.LG])
    Due to its high cost-effectiveness, sparsity has become the most important approach for building efficient deep-learning models. However, commodity accelerators are built mainly for efficient dense computation, creating a huge gap for general sparse computation to leverage. Existing solutions have to use time-consuming compiling to improve the efficiency of sparse kernels in an ahead-of-time manner and thus are limited to static sparsity. A wide range of dynamic sparsity opportunities is missed because their sparsity patterns are only known at runtime. This limits the future of building more biological brain-like neural networks that should be dynamically and sparsely activated. In this paper, we bridge the gap between sparse computation and commodity accelerators by proposing a system, called Spider, for efficiently executing deep learning models with dynamic sparsity. We identify an important property called permutation invariant that applies to most deep-learning computations. The property enables Spider (1) to extract dynamic sparsity patterns of tensors that are only known at runtime with little overhead; and (2) to transform the dynamic sparse computation into an equivalent dense computation which has been extremely optimized on commodity accelerators. Extensive evaluation on diverse models shows Spider can extract and transform dynamic sparsity with negligible overhead but brings up to 9.4x speedup over state-of-art solutions.  ( 2 min )
    Affective Faces for Goal-Driven Dyadic Communication. (arXiv:2301.10939v1 [cs.CV])
    We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker, our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context. Our approach further allows the listener to be conditioned on their own goals, personalities, or backgrounds. Our approach models conversations through a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we propose a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress. See our website for video results, data, and code: https://realtalk.cs.columbia.edu.  ( 2 min )
    Super-Resolution Analysis via Machine Learning: A Survey for Fluid Flows. (arXiv:2301.10937v1 [physics.flu-dyn])
    This paper surveys machine-learning-based super-resolution reconstruction for vortical flows. Super resolution aims to find the high-resolution flow fields from low-resolution data and is generally an approach used in image reconstruction. In addition to surveying a variety of recent super-resolution applications, we provide case studies of super-resolution analysis for an example of two-dimensional decaying isotropic turbulence. We demonstrate that physics-inspired model designs enable successful reconstruction of vortical flows from spatially limited measurements. We also discuss the challenges and outlooks of machine-learning-based super-resolution analysis for fluid flow applications. The insights gained from this study can be leveraged for super-resolution analysis of numerical and experimental flow data.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Efficient Trust Region-Based Safe Reinforcement Learning with Low-Bias Distributional Actor-Critic. (arXiv:2301.10923v1 [cs.LG])
    To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.  ( 2 min )
    Partial advantage estimator for proximal policy optimization. (arXiv:2301.10920v1 [cs.LG])
    Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to $\lambda$-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and $\mu$RTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.  ( 2 min )
    SuperFed: Weight Shared Federated Learning. (arXiv:2301.10879v1 [cs.LG])
    Federated Learning (FL) is a well-established technique for privacy preserving distributed training. Much attention has been given to various aspects of FL training. A growing number of applications that consume FL-trained models, however, increasingly operate under dynamically and unpredictably variable conditions, rendering a single model insufficient. We argue for training a global family of models cost efficiently in a federated fashion. Training them independently for different tradeoff points incurs $O(k)$ cost for any k architectures of interest, however. Straightforward applications of FL techniques to recent weight-shared training approaches is either infeasible or prohibitively expensive. We propose SuperFed - an architectural framework that incurs $O(1)$ cost to co-train a large family of models in a federated fashion by leveraging weight-shared learning. We achieve an order of magnitude cost savings on both communication and computation by proposing two novel training mechanisms: (a) distribution of weight-shared models to federated clients, (b) central aggregation of arbitrarily overlapping weight-shared model parameters. The combination of these mechanisms is shown to reach an order of magnitude (9.43x) reduction in computation and communication cost for training a $5*10^{18}$-sized family of models, compared to independently training as few as $k = 9$ DNNs without any accuracy loss.  ( 2 min )
    Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation. (arXiv:2301.10814v1 [q-bio.BM])
    Protein-ligand binding prediction is a fundamental problem in AI-driven drug discovery. Prior work focused on supervised learning methods using a large set of binding affinity data for small molecules, but it is hard to apply the same strategy to other drug classes like antibodies as labelled data is limited. In this paper, we explore unsupervised approaches and reformulate binding energy prediction as a generative modeling task. Specifically, we train an energy-based model on a set of unlabelled protein-ligand complexes using SE(3) denoising score matching and interpret its log-likelihood as binding affinity. Our key contribution is a new equivariant rotation prediction network called Neural Euler's Rotation Equations (NERE) for SE(3) score matching. It predicts a rotation by modeling the force and torque between protein and ligand atoms, where the force is defined as the gradient of an energy function with respect to atom coordinates. We evaluate NERE on protein-ligand and antibody-antigen binding affinity prediction benchmarks. Our model outperforms all unsupervised baselines (physics-based and statistical potentials) and matches supervised learning methods in the antibody case.  ( 2 min )
    Joint action loss for proximal policy optimization. (arXiv:2301.10919v1 [cs.LG])
    PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.  ( 2 min )
    Distilling Cognitive Backdoor Patterns within an Image. (arXiv:2301.10908v1 [cs.LG])
    This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.  ( 2 min )
    Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL). (arXiv:2301.10888v1 [cs.LG])
    Introduction Data imbalance is one of the crucial issues in big data analysis with fewer labels. For example, in real-world healthcare data, spam detection labels, and financial fraud detection datasets. Many data balance methods were introduced to improve machine learning algorithms' performance. Research claims SMOTE and SMOTE-based data-augmentation (generate new data points) methods could improve algorithm performance. However, we found in many online tutorials, the valuation methods were applied based on synthesized datasets that introduced bias into the evaluation, and the performance got a false improvement. In this study, we proposed, a new evaluation framework for imbalanced data learning methods. We have experimented on five data balance methods and whether the performance of algorithms will improve or not. Methods We collected 8 imbalanced healthcare datasets with different imbalanced rates from different domains. Applied 6 data augmentation methods with 11 machine learning methods testing if the data augmentation will help with improving machine learning performance. We compared the traditional data augmentation evaluation methods with our proposed cross-validation evaluation framework Results Using traditional data augmentation evaluation meta hods will give a false impression of improving the performance. However, our proposed evaluation method shows data augmentation has limited ability to improve the results. Conclusion EFIDL is more suitable for evaluating the prediction performance of an ML method when data are augmented. Using an unsuitable evaluation framework will give false results. Future researchers should consider the evaluation framework we proposed when dealing with augmented datasets. Our experiments showed data augmentation does not help improve ML prediction performance.  ( 2 min )
    GPU-based Private Information Retrieval for On-Device Machine Learning Inference. (arXiv:2301.10904v1 [cs.CR])
    On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information during on-device ML inference. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) develop a novel algorithm for accelerating PIR on GPUs, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than $20 \times$ over an optimized CPU PIR implementation, and our co-design techniques obtain over $5 \times$ additional throughput improvement at fixed model quality. Together, on various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100,000$ queries per second -- a $>100 \times$ throughput improvement over a naively implemented system -- while maintaining model accuracy, and limiting inference communication and response latency to within $300$KB and $<100$ms respectively.  ( 2 min )
    Learning Gradients of Convex Functions with Monotone Gradient Networks. (arXiv:2301.10862v1 [cs.LG])
    While much effort has been devoted to deriving and studying effective convex formulations of signal processing problems, the gradients of convex functions also have critical applications ranging from gradient-based optimization to optimal transport. Recent works have explored data-driven methods for learning convex objectives, but learning their monotone gradients is seldom studied. In this work, we propose Cascaded and Modular Monotone Gradient Networks (C-MGN and M-MGN respectively), two monotone gradient neural network architectures for directly learning the gradients of convex functions. We show that our networks are simpler to train, learn monotone gradient fields more accurately, and use significantly fewer parameters than state of the art methods. We further demonstrate their ability to learn optimal transport mappings to augment driving image data.  ( 2 min )
    Reef-insight: A framework for reef habitat mapping with clustering methods via remote sensing. (arXiv:2301.10876v1 [cs.LG])
    Environmental damage has been of much concern, particularly coastal areas and the oceans given climate change and drastic effects of pollution and extreme climate events. Our present day analytical capabilities along with the advancements in information acquisition techniques such as remote sensing can be utilized for the management and study of coral reef ecosystems. In this paper, we present Reef-insight, an unsupervised machine learning framework that features advanced clustering methods and remote sensing for reef community mapping. Our framework compares different clustering methods to evaluate them for reef community mapping using remote sensing data. We evaluate four major clustering approaches such as k- means, hierarchical clustering, Gaussian mixture model, and density-based clustering based on qualitative and visual assessment. We utilise remote sensing data featuring Heron reef island region in the Great Barrier Reef of Australia. Our results indicate that clustering methods using remote sensing data can well identify benthic and geomorphic clusters that are found in reefs when compared to other studies. Our results indicate that Reef-insight can generate detailed reef community maps outlining distinct reef habitats and has the potential to enable further insights for reef restoration projects. We release our framework as open source software to enable its extension to different parts of the world  ( 2 min )
    Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram. (arXiv:2301.10856v1 [cs.CY])
    In response to disinformation and propaganda from Russian online media following the Russian invasion of Ukraine, Russian outlets including Russia Today and Sputnik News were banned throughout Europe. Many of these Russian outlets, in order to reach their audiences, began to heavily promote their content on messaging services like Telegram. In this work, to understand this phenomenon, we study how 16 Russian media outlets have interacted with and utilized 732 Telegram channels throughout 2022. To do this, we utilize a multilingual version of the foundational model MPNet to embed articles and Telegram messages in a shared embedding space and semantically compare content. Leveraging a parallelized version of DP-Means clustering, we perform paragraph-level topic/narrative extraction and time-series analysis with Hawkes Processes. With this approach, across our websites, we find between 2.3% (ura.news) and 26.7% (ukraina.ru) of their content originated/resulted from activity on Telegram. Finally, tracking the spread of individual narratives, we measure the rate at which these websites and channels disseminate content within the Russian media ecosystem.  ( 2 min )
    Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning. (arXiv:2301.10886v1 [cs.LG])
    We present AIRS: Automatic Intrinsic Reward Shaping that intelligently and adaptively provides high-quality intrinsic rewards to enhance exploration in reinforcement learning (RL). More specifically, AIRS selects shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem. Moreover, we develop an intrinsic reward toolkit to provide efficient and reliable implementations of diverse intrinsic reward approaches. We test AIRS on various tasks of Procgen games and DeepMind Control Suite. Extensive simulation demonstrates that AIRS can outperform the benchmarking schemes and achieve superior performance with simple architecture.  ( 2 min )
    When Layers Play the Lottery, all Tickets Win at Initialization. (arXiv:2301.10835v1 [cs.LG])
    Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets.  ( 2 min )
    RobustPdM: Designing Robust Predictive Maintenance against Adversarial Attacks. (arXiv:2301.10822v1 [cs.CR])
    The state-of-the-art predictive maintenance (PdM) techniques have shown great success in reducing maintenance costs and downtime of complicated machines while increasing overall productivity through extensive utilization of Internet-of-Things (IoT) and Deep Learning (DL). Unfortunately, IoT sensors and DL algorithms are both prone to cyber-attacks. For instance, DL algorithms are known for their susceptibility to adversarial examples. Such adversarial attacks are vastly under-explored in the PdM domain. This is because the adversarial attacks in the computer vision domain for classification tasks cannot be directly applied to the PdM domain for multivariate time series (MTS) regression tasks. In this work, we propose an end-to-end methodology to design adversarially robust PdM systems by extensively analyzing the effect of different types of adversarial attacks and proposing a novel adversarial defense technique for DL-enabled PdM models. First, we propose novel MTS Projected Gradient Descent (PGD) and MTS PGD with random restarts (PGD_r) attacks. Then, we evaluate the impact of MTS PGD and PGD_r along with MTS Fast Gradient Sign Method (FGSM) and MTS Basic Iterative Method (BIM) on Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), and Bi-directional LSTM based PdM system. Our results using NASA's turbofan engine dataset show that adversarial attacks can cause a severe defect (up to 11X) in the RUL prediction, outperforming the effectiveness of the state-of-the-art PdM attacks by 3X. Furthermore, we present a novel approximate adversarial training method to defend against adversarial attacks. We observe that approximate adversarial training can significantly improve the robustness of PdM models (up to 54X) and outperforms the state-of-the-art PdM defense methods by offering 3X more robustness.  ( 2 min )
    Salesforce CausalAI Library: A Fast and Scalable Framework for Causal Analysis of Time Series and Tabular Data. (arXiv:2301.10859v1 [cs.LG])
    We introduce the Salesforce CausalAI Library, an open-source library for causal analysis using observational data. It supports causal discovery and causal inference for tabular and time series data, of both discrete and continuous types. This library includes algorithms that handle linear and non-linear causal relationships between variables, and uses multi-processing for speed-up. We also include a data generator capable of generating synthetic data with specified structural equation model for both the aforementioned data formats and types, that helps users control the ground-truth causal process while investigating various algorithms. Finally, we provide a user interface (UI) that allows users to perform causal analysis on data without coding. The goal of this library is to provide a fast and flexible solution for a variety of problems in the domain of causality. This technical report describes the Salesforce CausalAI API along with its capabilities, the implementations of the supported algorithms, and experiments demonstrating their performance and speed. Our library is available at \url{https://github.com/salesforce/causalai}.  ( 2 min )
    Improved Bitcoin Price Prediction based on COVID-19 data. (arXiv:2301.10840v1 [cs.LG])
    Social turbulence can affect people financial decisions, causing changes in spending and saving. During a global turbulence as significant as the COVID-19 pandemic, such changes are inevitable. Here we examine how the effects of COVID-19 on various jurisdictions influenced the global price of Bitcoin. We hypothesize that lock downs and expectations of economic recession erode people trust in fiat (government-issued) currencies, thus elevating cryptocurrencies. Hence, we expect to identify a causal relation between the turbulence caused by the pandemic, demand for Bitcoin, and ultimately its price. To test the hypothesis, we merged datasets of Bitcoin prices and COVID-19 cases and deaths. We also engineered extra features and applied statistical and machine learning (ML) models. We applied a Random Forest model (RF) to identify and rank the feature importance, and ran a Long Short-Term Memory (LSTM) model on Bitcoin prices data set twice: with and without accounting for COVID-19 related features. We find that adding COVID-19 data into the LSTM model improved prediction of Bitcoin prices.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    Unravelling physics beyond the standard model with classical and quantum anomaly detection. (arXiv:2301.10787v1 [hep-ex])
    Much hope for finding new physics phenomena at microscopic scale relies on the observations obtained from High Energy Physics experiments, like the ones performed at the Large Hadron Collider (LHC). However, current experiments do not indicate clear signs of new physics that could guide the development of additional Beyond Standard Model (BSM) theories. Identifying signatures of new physics out of the enormous amount of data produced at the LHC falls into the class of anomaly detection and constitutes one of the greatest computational challenges. In this article, we propose a novel strategy to perform anomaly detection in a supervised learning setting, based on the artificial creation of anomalies through a random process. For the resulting supervised learning problem, we successfully apply classical and quantum Support Vector Classifiers (CSVC and QSVC respectively) to identify the artificial anomalies among the SM events. Even more promising, we find that employing an SVC trained to identify the artificial anomalies, it is possible to identify realistic BSM events with high accuracy. In parallel, we also explore the potential of quantum algorithms for improving the classification accuracy and provide plausible conditions for the best exploitation of this novel computational paradigm.  ( 2 min )
    Increasing Fairness in Compromise on Accuracy via Weighted Vote with Learning Guarantees. (arXiv:2301.10813v1 [cs.LG])
    As the bias issue is being taken more and more seriously in widely applied machine learning systems, the decrease in accuracy in most cases deeply disturbs researchers when increasing fairness. To address this problem, we present a novel analysis of the expected fairness quality via weighted vote, suitable for both binary and multi-class classification. The analysis takes the correction of biased predictions by ensemble members into account and provides learning bounds that are amenable to efficient minimisation. We further propose a pruning method based on this analysis and the concepts of domination and Pareto optimality, which is able to increase fairness under a prerequisite of little or even no accuracy decline. The experimental results indicate that the proposed learning bounds are faithful and that the proposed pruning method can indeed increase ensemble fairness without much accuracy degradation.  ( 2 min )
    Gene-SGAN: a method for discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering. (arXiv:2301.10772v1 [q-bio.QM])
    Disease heterogeneity has been a critical challenge for precision diagnosis and treatment, especially in neurologic and neuropsychiatric diseases. Many diseases can display multiple distinct brain phenotypes across individuals, potentially reflecting disease subtypes that can be captured using MRI and machine learning methods. However, biological interpretability and treatment relevance are limited if the derived subtypes are not associated with genetic drivers or susceptibility factors. Herein, we describe Gene-SGAN - a multi-view, weakly-supervised deep clustering method - which dissects disease heterogeneity by jointly considering phenotypic and genetic data, thereby conferring genetic correlations to the disease subtypes and associated endophenotypic signatures. We first validate the generalizability, interpretability, and robustness of Gene-SGAN in semi-synthetic experiments. We then demonstrate its application to real multi-site datasets from 28,858 individuals, deriving subtypes of Alzheimer's disease and brain endophenotypes associated with hypertension, from MRI and SNP data. Derived brain phenotypes displayed significant differences in neuroanatomical patterns, genetic determinants, biological and clinical biomarkers, indicating potentially distinct underlying neuropathologic processes, genetic drivers, and susceptibility factors. Overall, Gene-SGAN is broadly applicable to disease subtyping and endophenotype discovery, and is herein tested on disease-related, genetically-driven neuroimaging phenotypes.  ( 2 min )
    Quantum anomaly detection in the latent space of proton collision events at the LHC. (arXiv:2301.10780v1 [quant-ph])
    We propose a new strategy for anomaly detection at the LHC based on unsupervised quantum machine learning algorithms. To accommodate the constraints on the problem size dictated by the limitations of current quantum hardware we develop a classical convolutional autoencoder. The designed quantum anomaly detection models, namely an unsupervised kernel machine and two clustering algorithms, are trained to find new-physics events in the latent representation of LHC data produced by the autoencoder. The performance of the quantum algorithms is benchmarked against classical counterparts on different new-physics scenarios and its dependence on the dimensionality of the latent space and the size of the training dataset is studied. For kernel-based anomaly detection, we identify a regime where the quantum model significantly outperforms its classical counterpart. An instance of the kernel machine is implemented on a quantum computer to verify its suitability for available hardware. We demonstrate that the observed consistent performance advantage is related to the inherent quantum properties of the circuit used.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )
    Generative Tertiary Structure-based RNA Design. (arXiv:2301.10774v1 [q-bio.BM])
    Learning from 3D biological macromolecules with artificial intelligence technologies has been an emerging area. Computational protein design, known as the inverse of protein structure prediction, aims to generate protein sequences that will fold into the defined structure. Analogous to protein design, RNA design is also an important topic in synthetic biology, which aims to generate RNA sequences by given structures. However, existing RNA design methods mainly focus on the secondary structure, ignoring the informative tertiary structure, which is commonly used in protein design. To explore the complex coupling between RNA sequence and 3D structure, we introduce an RNA tertiary structure modeling method to efficiently capture useful information from the 3D structure of RNA. For a fair comparison, we collect abundant RNA data and split the data according to tertiary structures. With the standard dataset, we conduct a benchmark by employing structure-based protein design approaches with our RNA tertiary structure modeling method. We believe our work will stimulate the future development of tertiary structure-based RNA design and bridge the gap between the RNA 3D structures and sequences.  ( 2 min )
  • Open

    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Two-step interpretable modeling of Intensive Care Acquired Infections. (arXiv:2301.11146v1 [stat.AP])
    We present a novel methodology for integrating high resolution longitudinal data with the dynamic prediction capabilities of survival models. The aim is two-fold: to improve the predictive power while maintaining interpretability of the models. To go beyond the black box paradigm of artificial neural networks, we propose a parsimonious and robust semi-parametric approach (i.e., a landmarking competing risks model) that combines routinely collected low-resolution data with predictive features extracted from a convolutional neural network, that was trained on high resolution time-dependent information. We then use saliency maps to analyze and explain the extra predictive power of this model. To illustrate our methodology, we focus on healthcare-associated infections in patients admitted to an intensive care unit.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Causal Inference with Hidden Mediators. (arXiv:2111.02927v2 [math.ST] UPDATED)
    Proximal causal inference was recently proposed as a framework to identify causal effects from observational data in the presence of hidden confounders for which proxies are available. In this paper, we extend the proximal causal inference approach to settings where identification of causal effects hinges upon a set of mediators which are not observed, yet error prone proxies of the hidden mediators are measured. Specifically, (i) We establish causal hidden mediation analysis, which extends classical causal mediation analysis methods for identifying natural direct and indirect effects under no unmeasured confounding to a setting where the mediator of interest is hidden, but proxies of it are available. (ii) We establish hidden front-door criterion, which extends the classical front-door criterion to allow for hidden mediators for which proxies are available. (iii) We show that the identification of a certain causal effect called population intervention indirect effect remains possible with hidden mediators in settings where challenges in (i) and (ii) might co-exist. We view (i)-(iii) as important steps towards the practical application of front-door criteria and mediation analysis as mediators are almost always measured with error and thus, the most one can hope for in practice is that the measurements are at best proxies of mediating mechanisms. We propose identification approaches for the parameters of interest in our considered models. For the estimation aspect, we propose an influence function-based estimation method and provide an analysis for the robustness of the estimators.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    On the Dissipation of Ideal Hamiltonian Monte Carlo Sampler. (arXiv:2209.07438v2 [stat.CO] UPDATED)
    We report on what seems to be an intriguing connection between variable integration time and partial velocity refreshment of Ideal Hamiltonian Monte Carlo samplers, both of which can be used for reducing the dissipative behavior of the dynamics. More concretely, we show that on quadratic potentials, efficiency can be improved through these means by a $\sqrt{\kappa}$ factor in Wasserstein-2 distance, compared to classical constant integration time, fully refreshed HMC. We additionally explore the benefit of randomized integrators for simulating the Hamiltonian dynamics under higher order regularity conditions.  ( 2 min )
    Conformal Prediction for Trustworthy Detection of Railway Signals. (arXiv:2301.11136v1 [stat.ML])
    We present an application of conformal prediction, a form of uncertainty quantification with guarantees, to the detection of railway signals. State-of-the-art architectures are tested and the most promising one undergoes the process of conformalization, where a correction is applied to the predicted bounding boxes (i.e. to their height and width) such that they comply with a predefined probability of success. We work with a novel exploratory dataset of images taken from the perspective of a train operator, as a first step to build and validate future trustworthy machine learning models for the detection of railway signals.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Granger Causal Chain Discovery for Sepsis-Associated Derangements via Multivariate Hawkes Processes. (arXiv:2209.04480v4 [stat.AP] UPDATED)
    Modern health care systems are conducting continuous, automated surveillance of the electronic medical record (EMR) to identify adverse events with increasing frequency; however, many events such as sepsis do not have elucidated prodromes (i.e., event chains) that can be used to identify and intercept the adverse event early in its course. Currently, there does not exist reliable framework for discovering or describing causal chains that precede adverse hospital events. Clinically relevant and interpretable results require a framework that can (1) infer temporal interactions across multiple patient features found in EMR data (e.g., labs, vital signs, etc.) and (2) can identify patterns that precede and are specific to an impending adverse event (e.g., sepsis). In this work, we propose a linear multivariate Hawkes process model, coupled with ReLU link function, to recover a Granger Causal (GC) graph with both exciting and inhibiting effects. We develop a scalable two-phase gradient-based method to maximize a surrogate-likelihood and estimate the problem parameters, which is shown to be effective via extensive numerical simulation. Our method is subsequently extended to a data set of patients admitted to an academic level 1 trauma center located in Atalanta, GA, where the estimated GC graph identifies several highly interpretable chains that precede sepsis. Here, we demonstrate the effectiveness of our approach in learning a GC graph over Sepsis Associated Derangements (SADs), but it can be generalized to other applications with similar requirements.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minimax estimation of discontinuous optimal transport maps: The semi-discrete case. (arXiv:2301.11302v1 [math.ST])
    We consider the problem of estimating the optimal transport map between two probability distributions, $P$ and $Q$ in $\mathbb R^d$, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution $Q$ is a discrete measure supported on a finite number of points in $\mathbb R^d$. We study a computationally efficient estimator initially proposed by Pooladian and Niles-Weed (2021), based on entropic optimal transport, and show in the semi-discrete setting that it converges at the minimax-optimal rate $n^{-1/2}$, independent of dimension. Other standard map estimation techniques both lack finite-sample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    Learning from Mistakes: Self-Regularizing Hierarchical Semantic Representations in Point Cloud Segmentation. (arXiv:2301.11145v1 [cs.CV])
    Recent advances in autonomous robotic technologies have highlighted the growing need for precise environmental analysis. LiDAR semantic segmentation has gained attention to accomplish fine-grained scene understanding by acting directly on raw content provided by sensors. Recent solutions showed how different learning techniques can be used to improve the performance of the model, without any architectural or dataset change. Following this trend, we present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model. First, classes are clustered into macro groups according to mutual prediction errors; then, the learning process is regularized by: (1) aligning class-conditional prototypical feature representation for both fine and coarse classes, (2) weighting instances with a per-class fairness index. Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture; indeed, experimental results showed that it enables state-of-the-art performances on different architectures, datasets and tasks, while ensuring more balanced class-wise results and faster convergence.  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )

  • Open

    ChatGPT can definitely print Russian propaganda including why Prime Minister Justin Trudeau should be charged with treason despite its Wikipedia page
    submitted by /u/Robinsonc1988 [link] [comments]  ( 40 min )
    Text-To-4D Dynamic Scene Generation
    submitted by /u/bperki8 [link] [comments]  ( 40 min )
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more!
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more! Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’re one of the first mobile multipurpose AI apps. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA(Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature, storylines, answer questions and more, (think of chatgpt). We’d love to have some people try it out, give us feedback, and keep in touch with us. https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/SonnyDoge22 [link] [comments]  ( 41 min )
    We need AI to take over all creative mediums and hobbies
    Before you quick draw shoot the messenger here, I want you to hear me out. This is coming from a person who enjoys drawing digital art and a person who used to make music for fun. Humans tend to be very selfish creatures, not in an evil sense, but rather a "phew, glad that was you and not me" sense, like if someone stepped in a pile of of dog poop that sucks for them, but it's not your problem. Once everybodies creative (not labor intenseive) livelyhoods get taken, then yeah, I could see how now AI replicating humanity could be a problem. Then maybe after that, people can enjoy not being replaced. That is all I have to say, god bless my reddit karma, I can feel this subreddit getting ready to dislike bomb this, but as long as this message gets out, that's all I want. submitted by /u/Zima_Re-L [link] [comments]  ( 41 min )
    AI Dream 150 - MY INCREDIBLE DREAM VISUALIZED BY AI - Part3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Humanity May Reach Singularity Within Just 7 Years, Trend Shows
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    VoiceGPT - ChatGPT Voice Assistant
    submitted by /u/nickbild [link] [comments]  ( 40 min )
    The Big Tech Royal Rumble for AI
    submitted by /u/foundersblock [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Google MusicLM turns language into music
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Looking to Convert Photos to Video - Anything Out There?
    I want to be able to upload several photos (10 - 100) and have the AI generate a video using those photos. I've seen Genmo, but that only uses one photo at a time. Is there anything that may be able to do this? Thanks submitted by /u/venicerocco [link] [comments]  ( 40 min )
    Freaky A.I concept..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    fully AI made video i found on yt
    https://www.youtube.com/watch?v=Vw-t826JcDQ submitted by /u/Optimal_Studio_2050 [link] [comments]  ( 40 min )
    Is there any AI that is able to decipher the lyrics of a song?
    Specifically this one: https://www.youtube.com/watch?v=MFv7apjatwM&ab_channel=Lux-Topic If there is no current AI that is able to listen to a song and write down the lyrics accurately, then I provide this idea freely. submitted by /u/A_Very_Horny_Zed [link] [comments]  ( 40 min )
    📌[Searchcolab] Voice Cloning + Image Processing + Lip Syncing. Link in comments
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Pix2Pix AI Model Inside Stable Diffusion Installation Guide!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    What will happen to the internet once ChatGPT disintermediates websites?
    If ChatGPT and AI search engines become the central place for acquiring knowledge, essentially scraping and disintermediating websites; what commercial incentive will many website owners have to generate new content once their ad revenue is gone? What will happen to the internet? submitted by /u/DelPrive235 [link] [comments]  ( 41 min )
    🧬 The Age of A.I., Longevity and Biotech - Is a "Synthetic biology singularity" coming?
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    Outsmarting AI Detection Tools: How to Make Your AI-Generated Content Fly Under the Radar
    submitted by /u/PapaDudu [link] [comments]  ( 43 min )
    Don’t Chat With ChatGPT: Amazon’s Warning To Employees
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    what is the best way to upscale comics? BSRGAN doesn't work well with text.
    submitted by /u/mhczbnoykrqvzazfth [link] [comments]  ( 41 min )
    📌[Searchcolab] Text-To-4D Dynamic Scene Generation.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    Why is AI better at visual art than music generation?
    I've just been wondering why people have been able to develop AI models such as GANs and more recently diffusion models that are so good at creating images but music generation has remained somewhat stagnant. From what I understand, there are two ways AI are generally trained on music: through MIDI files and using raw audio. The MIDI files are probably easier for the AI to work with but there's massive loss of relevant data that an AI would need to really become proficient in music generation. On the other hand, AIs trained on the raw audio tend to produce somewhat fuzzy audio that often includes what sounds like nearly distinguishable lyrics. I'm personally convinced the latter is the method that will be universally used in the future, but it clearly has a long way to go. So does anyone have any insights regarding why it's straggling a bit? Perhaps it's because software like Stable Diffusion are piggybacking off the image recognition advancements that have been in very high demand the past couple decades? Perhaps music is, in fact, a more complicated or more strict artform and our human minds are biased to think that these two tasks should be roughly equivalent in difficulty? It's certainly not due to a lack of data to train on, so maybe we just don't have a model that's really suited to analyzing and imitating music. submitted by /u/No-Phrase1116 [link] [comments]  ( 44 min )
  • Open

    MusicLM: Generating Music From Text
    submitted by /u/nickb [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
  • Open

    Pygame window does not closes and kernel freezes
    I was rendering 'MountainCar-v0' environment in "human" render_mode. When I call env.close(), my pygame window doesn't closes automatically and I need to Force Quit it, and unfortunately my kernel dies. I'm using Jupyter Notebook. Following are my versions: PC: Mac OS Ventura 13.1 - Python: Python 3.9.9 - Jupyter:jupyter 1.0.0 - gym: 0.26.2 - ipykernel: 6.6.0 - ipython: 7.30.1 ​ Please let me know how I can solve this issue, Thanks! submitted by /u/Character-Yellow-796 [link] [comments]  ( 41 min )
    Tuning hyperparamters in RL.
    Hello everyone. I have a question about hyperparameter tuning for PPO. usually, when we tune them ( using Optuna for example), each set of hyperparameters is tested for a small number of steps ( 150000 for example ) then we pick the ones that yielded the best reward. Yet, in the long run, the asymptotic convergence might not be reached by the best hyperparameters found during tuning but rather, other hyperparameters that performed worse during tuning are the ones that might results in better asymptotic convergence. Is there any way to overcome this issue and maybe find the best hyperparameters without having to test for a large number of steps. Note: I'm using a custom environment with a continuous action space and image observations. submitted by /u/Many_Reception_4921 [link] [comments]  ( 42 min )
    Model-based hierarchical reinforcement learning
    Hi, do you know papers that combine model-based and hierarchical reinforcement learning, where also the lower level is a model-based approach? I cannot find sufficient paper about it submitted by /u/aika98oe [link] [comments]  ( 42 min )
  • Open

    [D] Monitoring and Retraining Models with Label-Changing Interventions
    When a trained ML model is implemented to predict an adverse event, the user might take steps to avoid that event. The outcome of the user's actions could either be successful or failure. In training with strictly observational data, a typical confusion matrix contains: ŷ=0, y=0 -> True Negative ŷ=0, y=1 -> False Negative ŷ=1, y=1 -> True Positive ŷ=1, y=0 -> False Positive When using the model, some results get confounded if the user acts based on the predictions ŷ=0, y=0, a=0 -> True Negative, No Intervention ŷ=0, y=1, a=0 -> False Negative, No Intervention ŷ=1, y=1, a=1 -> True Positive, Failed Intervention ŷ=1, y=0, a=1 -> False Positive OR Successful Intervention Ignoring the possibility that the intervention caused the adverse event, the involvement of the user may lead to an increase in the number of false positives that are perceived. Continuous monitoring becomes difficult due to perceived faster degradation of the model. Furthermore, retraining the model in the future may be hindered by labels that do not accurately reflect the true values. One approach that I've been proposed is to make sure there is always a hold-out set. Allow some random records to get scores, but do never act on them. This gives both a monitoring and retraining dataset. Are there other solutions that people use here? I've found the papers below, but I cannot say that I completely understand how to practically implement them. Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions (https://arxiv.org/abs/2211.09781) Model updating after interventions paradoxically introduces bias (https://arxiv.org/abs/2010.11530) submitted by /u/waiting4omscs [link] [comments]  ( 43 min )
    [P] Using algorithms or models from papers for commercial use
    Hey! I am reading the GET3D paper by Nvidia. The paper is listed with the Nvidia license which states: 3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. The Work or derivative works thereof may be used or intended for use by Nvidia or its affiliates commercially or non-commercially. As used herein, "non-commercially" means for research or evaluation purposes only and not for any direct or indirect monetary gain. Does it mean there is no commercial way of using the ideas in the paper? Is it possible to use the ideas from that paper or any other paper by Nvidia in some product? As the idea from the paper is only the tool or a part of the product but is not the product itself. submitted by /u/romantimm25 [link] [comments]  ( 44 min )
    [D] Google Predoctoral Program (India) 2023
    Has anyone got any interview email? Did they start the interview process? submitted by /u/Around-star [link] [comments]  ( 42 min )
    [D] ImageNet2012 Advice
    I'm currently at the point in my PhD career that I've developed some extremely successful components of CNNs, architecture, activation, etc. Outperforming default choices on CIFAR10, CIFAR100, Flowers, Caltech101, and other smaller datasets. With how success the results currently are, we want to publish to a top tier conference, specifically NeurIPS this Spring, deadline around May 13th. However, we (me and my advisor) agree that to publish at NeurIPS, our developments need to be backed up by ImageNet. The problem is that we have never trained on ImageNet before (so no experience), and have a limited computational budget. Although our university personally owns 2 A100 40 GB GPUs that we can use, they are shared within the entire university, so a 2 day job takes about 1 week in queue (don't know if we can get the results in time by May). On the other hand, we also don't know if we can get a $2500 grant in time to use cloud resources. For those who have trained on ImageNet, what are some common pitfalls, best ways to transfer data, downloading the dataset, etc? If you performed it on the cloud, how did you do so? How long was your time to train? Expenses? Did you run each model once or three times? Early stopping using validation or test set? NOTE: We will only be using Tensorflow... submitted by /u/MyActualUserName99 [link] [comments]  ( 44 min )
    [D] Best large language model for Named Entity Extraction?
    I'd like to extract named entities, something like this: "[Text]: Microsoft (the word being a portmanteau of "microcomputer software") was founded by Bill Gates on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. Steve Ballmer replaced Gates as CEO in 2000, and later envisioned a "devices and services" strategy. [Name]: Steve Ballmer [Position]: CEO [Company]: Microsoft " Tried it on GPT-Neox with 20b parameters with mixed success, is there anything better out there to try for a few-shot learning (without fine tuning)? submitted by /u/TankAttack [link] [comments]  ( 44 min )
    [R] ETLP: Event-based Three-factor Local Plasticity for online learning with neuromorphic hardware
    Neuromorphic perception with event-based sensors, asynchronous hardware and spiking neurons is showing promising results for real-time and energy-efficient inference in embedded systems. The next promise of brain-inspired computing is to enable adaptation to changes at the edge with online learning. However, the parallel and distributed architectures of neuromorphic hardware based on co-localized compute and memory imposes locality constraints to the on-chip learning rules. We propose in this work the Event-based Three-factor Local Plasticity (ETLP) rule that uses (1) the pre-synaptic spike trace, (2) the post-synaptic membrane voltage and (3) a third factor in the form of projected labels with no error calculation, that also serve as update triggers. We apply ETLP with feedforward and recurrent spiking neural networks on visual and auditory event-based pattern recognition, and compare it to Back-Propagation Through Time (BPTT) and eProp. We show a competitive performance in accuracy with a clear advantage in the computational complexity for ETLP. We also show that when using local plasticity, threshold adaptation in spiking neurons and a recurrent topology are necessary to learn spatio-temporal patterns with a rich temporal structure. Finally, we provide a proof of concept hardware implementation of ETLP on FPGA to highlight the simplicity of its computational primitives and how they can be mapped into neuromorphic hardware for online learning with low-energy consumption and real-time interaction. Full paper: https://arxiv.org/abs/2301.08281 submitted by /u/ferquinve [link] [comments]  ( 43 min )
    [D] MusicLM: Generating Music From Text
    How far do you think this can go? Is it a memorization machine or can it create new songs? https://google-research.github.io/seanet/musiclm/examples/ submitted by /u/carlthome [link] [comments]  ( 44 min )
    [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
    Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks, but are difficult to deploy because of their massive size and computational costs. For instance, the top-performing GPT-175B model has 175 billion parameters, which total at least 320GB (counting multiples of 1024) of storage in half-precision (FP16) format, leading it to require at least five A100 GPUs with 80GB of memory each for inference. It is therefore natural that there has been significant interest in reducing these costs via model compression. To date, virtually all existing GPT compression approaches have focused on quantization, that is, reducing the precision of the numerical representation of individual weights. A complementary approa…  ( 47 min )
    [D] Meta AI Residency 2023
    Now that apps are closed did anyone hear back yet? please follow this thread and update your status below, tbh I don't really think I have much of a chance but I'm excited none the less. ​ to follow a thread please press the bell on the top right submitted by /u/BeautyInUgly [link] [comments]  ( 42 min )
    [D] Moving away from Unicode for more equal token representation across global languages?
    Edit: as has been explained in the comments, unicode is not the issue so much as the byte-pair encoding scheme, which artificially limits the vocabulary size of the model and leads to less common language using more tokens. I'd like to discuss the impacts of increasing the vocabulary size on transformer model computational requirements. Many languages, like Chinese, Japanese Kanji, Korean, Telugu, etc use complex logograms to represent words and concepts. Unfortunately, these languages are severely "punished" in GPT3 because they are expensive to tokenize due to the way unicode represents them. Instead of unicode representing them as a single code point, logograms are typically represented as a sum of multiple graphemes, meaning that multiple unicode code points underlie their descriptio…  ( 49 min )
    [D] Why are there no End2End Speech Recognition models using the same Encoder-Decoder learning process as BART (no CTC) ?
    I'm new to CTC. After learning about CTC and its application in End2End training for Speech Recognition, I figured that if we want to generate a target sequence (transcript), given a source sequence features, we could use the vanilla Encoder-Decoder architecture in Transformer (also used in T5, BART, etc) alone, without the need of CTC, yet why people are only using CTC for End2End Speech Recoginition, or using hybrid of CTC and Decoder in some papers ? Thanks. submitted by /u/KarmaCut132 [link] [comments]  ( 43 min )
  • Open

    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.  ( 3 min )
    The Devil is the Classifier: Investigating Long Tail Relation Classification with Decoupling Analysis. (arXiv:2009.07022v1 [cs.LG] CROSS LISTED)
    Long-tailed relation classification is a challenging problem as the head classes may dominate the training phase, thereby leading to the deterioration of the tail performance. Existing solutions usually address this issue via class-balancing strategies, e.g., data re-sampling and loss re-weighting, but all these methods adhere to the schema of entangling learning of the representation and classifier. In this study, we conduct an in-depth empirical investigation into the long-tailed problem and found that pre-trained models with instance-balanced sampling already capture the well-learned representations for all classes; moreover, it is possible to achieve better long-tailed classification ability at low cost by only adjusting the classifier. Inspired by this observation, we propose a robust classifier with attentive relation routing, which assigns soft weights by automatically aggregating the relations. Extensive experiments on two datasets demonstrate the effectiveness of our proposed approach. Code and datasets are available in https://github.com/zjunlp/deepke.  ( 2 min )
    Contrastive Triple Extraction with Generative Transformer. (arXiv:2009.06207v8 [cs.CL] CROSS LISTED)
    Triple extraction is an essential task in information extraction for natural language processing and knowledge graph construction. In this paper, we revisit the end-to-end triple extraction task for sequence generation. Since generative triple extraction may struggle to capture long-term dependencies and generate unfaithful triples, we introduce a novel model, contrastive triple extraction with a generative transformer. Specifically, we introduce a single shared transformer module for encoder-decoder-based generation. To generate faithful results, we propose a novel triplet contrastive training object. Moreover, we introduce two mechanisms to further improve model performance (i.e., batch-wise dynamic attention-masking and triple-wise calibration). Experimental results on three datasets (i.e., NYT, WebNLG, and MIE) show that our approach achieves better performance than that of baselines.  ( 2 min )
    Interaction Modeling with Multiplex Attention. (arXiv:2208.10660v2 [cs.LG] UPDATED)
    Modeling multi-agent systems requires understanding how agents interact. Such systems are often difficult to model because they can involve a variety of types of interactions that layer together to drive rich social behavioral dynamics. Here we introduce a method for accurately modeling multi-agent systems. We present Interaction Modeling with Multiplex Attention (IMMA), a forward prediction model that uses a multiplex latent graph to represent multiple independent types of interactions and attention to account for relations of different strengths. We also introduce Progressive Layer Training, a training strategy for this architecture. We show that our approach outperforms state-of-the-art models in trajectory forecasting and relation inference, spanning three multi-agent scenarios: social navigation, cooperative task achievement, and team sports. We further demonstrate that our approach can improve zero-shot generalization and allows us to probe how different interactions impact agent behavior.  ( 2 min )
    Deep learning in a bilateral brain with hemispheric specialization. (arXiv:2209.06862v6 [q-bio.NC] UPDATED)
    The brains of all bilaterally symmetric animals on Earth are divided into left and right hemispheres. The anatomy and functionality of the hemispheres have a large degree of overlap, but there are asymmetries and they specialize to possess different attributes. Several studies have used computational models to mimic hemispheric asymmetries with a focus on reproducing human data on semantic and visual processing tasks. In this study, we aimed to understand how dual hemispheres could interact in a given task. We propose a bilateral artificial neural network that imitates a lateralization observed in nature: that the left hemisphere specializes in specificity and the right in generalities. We used two ResNet-9 convolutional neural networks with different training objectives and tested it on an image classification task. Our analysis found that the hemispheres represent complementary features that are exploited by a network head which implements a type of weighted attention. The bilateral architecture outperformed a range of baselines of similar representational capacity that don't exploit differential specialization. The results demonstrate the efficacy of bilateralism, contribute to an understanding of bilateralism in biological brains and the architecture serves as an inductive bias when designing new AI systems.  ( 2 min )
    Long-tail Relation Extraction via Knowledge Graph Embeddings and Graph Convolution Networks. (arXiv:1903.01306v1 [cs.IR] CROSS LISTED)
    We propose a distance supervised relation extraction approach for long-tailed, imbalanced data which is prevalent in real-world settings. Here, the challenge is to learn accurate "few-shot" models for classes existing at the tail of the class distribution, for which little data is available. Inspired by the rich semantic correlations between classes at the long tail and those at the head, we take advantage of the knowledge from data-rich classes at the head of the distribution to boost the performance of the data-poor classes at the tail. First, we propose to leverage implicit relational knowledge among class labels from knowledge graph embeddings and learn explicit relational knowledge using graph convolution networks. Second, we integrate that relational knowledge into relation extraction model by coarse-to-fine knowledge-aware attention mechanism. We demonstrate our results for a large-scale benchmark dataset which show that our approach significantly outperforms other baselines, especially for long-tail relations.  ( 2 min )
    Towards Dynamic Stability Assessment of Power Grid Topologies using Graph Neural Networks. (arXiv:2206.06369v3 [cs.LG] UPDATED)
    To mitigate climate change, the share of renewable energies in power production needs to be increased. Renewables introduce new challenges to power grids regarding the dynamic stability due to decentralization, reduced inertia and volatility in production. Since dynamic stability simulations are intractable and exceedingly expensive for large grids, graph neural networks (GNNs) are a promising method to reduce the computational effort of analyzing dynamic stability of power grids. We provide new datasets of dynamic stability of synthetic power grids and find that GNNs are surprisingly effective at predicting the highly non-linear targets from topological information only. Furthermore, we use GNNs to demonstrate the accurate identification of particularly vulnerable nodes in power grids, so-called troublemakers. Lastly, we find that GNNs trained on small grids generate accurate predictions on a large synthetic model of the Texan power grid, which illustrates the potential for real-world applications of the presented approach.  ( 2 min )
    Why the pseudo label based semi-supervised learning algorithm is effective?. (arXiv:2211.10039v2 [cs.LG] UPDATED)
    Recently, pseudo label based semi-supervised learning has achieved great success in many fields. The core idea of the pseudo label based semi-supervised learning algorithm is to use the model trained on the labeled data to generate pseudo labels on the unlabeled data, and then train a model to fit the previously generated pseudo labels. We give a theory analysis for why pseudo label based semi-supervised learning is effective in this paper. We mainly compare the generalization error of the model trained under two settings: (1) There are N labeled data. (2) There are N unlabeled data and a suitable initial model. Our analysis shows that, firstly, when the amount of unlabeled data tends to infinity, the pseudo label based semi-supervised learning algorithm can obtain model which have the same generalization error upper bound as model obtained by normally training in the condition of the amount of labeled data tends to infinity. More importantly, we prove that when the amount of unlabeled data is large enough, the generalization error upper bound of the model obtained by pseudo label based semi-supervised learning algorithm can converge to the optimal upper bound with linear convergence rate. We also give the lower bound on sampling complexity to achieve linear convergence rate. Our analysis contributes to understanding the empirical successes of pseudo label-based semi-supervised learning.  ( 2 min )
    ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning. (arXiv:2102.12828v3 [cs.CL] CROSS LISTED)
    This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM). We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model. Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively. We further conduct comprehensive model analysis and observe interesting error cases, which may promote future researches.  ( 2 min )
    Ultra-NeRF: Neural Radiance Fields for Ultrasound Imaging. (arXiv:2301.10520v1 [eess.IV])
    We present a physics-enhanced implicit neural representation (INR) for ultrasound (US) imaging that learns tissue properties from overlapping US sweeps. Our proposed method leverages a ray-tracing-based neural rendering for novel view US synthesis. Recent publications demonstrated that INR models could encode a representation of a three-dimensional scene from a set of two-dimensional US frames. However, these models fail to consider the view-dependent changes in appearance and geometry intrinsic to US imaging. In our work, we discuss direction-dependent changes in the scene and show that a physics-inspired rendering improves the fidelity of US image synthesis. In particular, we demonstrate experimentally that our proposed method generates geometrically accurate B-mode images for regions with ambiguous representation owing to view-dependent differences of the US images. We conduct our experiments using simulated B-mode US sweeps of the liver and acquired US sweeps of a spine phantom tracked with a robotic arm. The experiments corroborate that our method generates US frames that enable consistent volume compounding from previously unseen views. To the best of our knowledge, the presented work is the first to address view-dependent US image synthesis using INR.
    Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. (arXiv:2108.13161v7 [cs.CL] CROSS LISTED)
    Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.
    PromptKG: A Prompt Learning Framework for Knowledge Graph Representation Learning and Application. (arXiv:2210.00305v1 [cs.CL] CROSS LISTED)
    Knowledge Graphs (KGs) often have two characteristics: heterogeneous graph structure and text-rich entity/relation information. KG representation models should consider graph structures and text semantics, but no comprehensive open-sourced framework is mainly designed for KG regarding informative text description. In this paper, we present PromptKG, a prompt learning framework for KG representation learning and application that equips the cutting-edge text-based methods, integrates a new prompt learning model and supports various tasks (e.g., knowledge graph completion, question answering, recommendation, and knowledge probing). PromptKG is publicly open-sourced at https://github.com/zjunlp/PromptKG with long-term technical support.
    Conceptualized Representation Learning for Chinese Biomedical Text Mining. (arXiv:2008.10813v1 [cs.CL] CROSS LISTED)
    Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://github.com/alibaba-research/ChineseBLUE.
    Normal vs. Adversarial: Salience-based Analysis of Adversarial Samples for Relation Extraction. (arXiv:2104.00312v4 [cs.CL] CROSS LISTED)
    Recent neural-based relation extraction approaches, though achieving promising improvement on benchmark datasets, have reported their vulnerability towards adversarial attacks. Thus far, efforts mostly focused on generating adversarial samples or defending adversarial attacks, but little is known about the difference between normal and adversarial samples. In this work, we take the first step to leverage the salience-based method to analyze those adversarial samples. We observe that salience tokens have a direct correlation with adversarial perturbations. We further find the adversarial perturbations are either those tokens not existing in the training set or superficial cues associated with relation labels. To some extent, our approach unveils the characters against adversarial samples. We release an open-source testbed, "DiagnoseAdv" in https://github.com/zjunlp/DiagnoseAdv.
    Reasoning Through Memorization: Nearest Neighbor Knowledge Graph Embeddings. (arXiv:2201.05575v2 [cs.CL] CROSS LISTED)
    Previous knowledge graph embedding approaches usually map entities to representations and utilize score functions to predict the target entities, yet they struggle to reason rare or emerging unseen entities. In this paper, we propose kNN-KGE, a new knowledge graph embedding approach with pre-trained language models, by linearly interpolating its entity distribution with k-nearest neighbors. We compute the nearest neighbors based on the distance in the entity embedding space from the knowledge store. Our approach can allow rare or emerging entities to be memorized explicitly rather than implicitly in model parameters. Experimental results demonstrate that our approach can improve inductive and transductive link prediction results and yield better performance for low-resource settings with only a few triples, which might be easier to reason via explicit memory. Code is available at https://github.com/zjunlp/KNN-KG.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Logic-Based Explainability in Machine Learning. (arXiv:2211.00541v2 [cs.AI] UPDATED)
    The last decade witnessed an ever-increasing stream of successes in Machine Learning (ML). These successes offer clear evidence that ML is bound to become pervasive in a wide range of practical uses, including many that directly affect humans. Unfortunately, the operation of the most successful ML models is incomprehensible for human decision makers. As a result, the use of ML models, especially in high-risk and safety-critical settings is not without concern. In recent years, there have been efforts on devising approaches for explaining ML models. Most of these efforts have focused on so-called model-agnostic approaches. However, all model-agnostic and related approaches offer no guarantees of rigor, hence being referred to as non-formal. For example, such non-formal explanations can be consistent with different predictions, which renders them useless in practice. This paper overviews the ongoing research efforts on computing rigorous model-based explanations of ML models; these being referred to as formal explanations. These efforts encompass a variety of topics, that include the actual definitions of explanations, the characterization of the complexity of computing explanations, the currently best logical encodings for reasoning about different ML models, and also how to make explanations interpretable for human decision makers, among others.
    Disentangled Contrastive Learning for Learning Robust Textual Representations. (arXiv:2104.04907v2 [cs.CL] CROSS LISTED)
    Although the self-supervised pre-training of transformer models has resulted in the revolutionizing of natural language processing (NLP) applications and the achievement of state-of-the-art results with regard to various benchmarks, this process is still vulnerable to small and imperceptible permutations originating from legitimate inputs. Intuitively, the representations should be similar in the feature space with subtle input permutations, while large variations occur with different meanings. This motivates us to investigate the learning of robust textual representation in a contrastive manner. However, it is non-trivial to obtain opposing semantic instances for textual samples. In this study, we propose a disentangled contrastive learning method that separately optimizes the uniformity and alignment of representations without negative sampling. Specifically, we introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity. Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines, as well as achieve promising improvements with invariance tests and adversarial attacks. The code is available in https://github.com/zxlzr/DCL.
    Multi-Agent Deep Reinforcement Learning for Efficient Passenger Delivery in Urban Air Mobility. (arXiv:2211.06890v2 [cs.MA] UPDATED)
    It has been considered that urban air mobility (UAM), also known as drone-taxi or electrical vertical takeoff and landing (eVTOL), will play a key role in future transportation. By putting UAM into practical future transportation, several benefits can be realized, i.e., (i) the total travel time of passengers can be reduced compared to traditional transportation and (ii) there is no environmental pollution and no special labor costs to operate the system because electric batteries will be used in UAM system. However, there are various dynamic and uncertain factors in the flight environment, i.e., passenger sudden service requests, battery discharge, and collision among UAMs. Therefore, this paper proposes a novel cooperative MADRL algorithm based on centralized training and distributed execution (CTDE) concepts for reliable and efficient passenger delivery in UAM networks. According to the performance evaluation results, we confirm that the proposed algorithm outperforms other existing algorithms in terms of the number of serviced passengers increase (30%) and the waiting time per serviced passenger decrease (26%).
    Self-Supervised Hierarchical Metrical Structure Modeling. (arXiv:2210.17183v2 [cs.SD] UPDATED)
    We propose a novel method to model hierarchical metrical structures for both symbolic music and audio signals in a self-supervised manner with minimal domain knowledge. The model trains and inferences on beat-aligned music signals and predicts an 8-layer hierarchical metrical tree from beat, measure to the section level. The training procedure does not require any hierarchical metrical labeling except for beats, purely relying on the nature of metrical regularity and inter-voice consistency as inductive biases. We show in experiments that the method achieves comparable performance with supervised baselines on multiple metrical structure analysis tasks on both symbolic music and audio signals. All demos, source code and pre-trained models are publicly available on GitHub.
    On Robustness and Bias Analysis of BERT-based Relation Extraction. (arXiv:2009.06206v5 [cs.CL] CROSS LISTED)
    Fine-tuning pre-trained models have achieved impressive performance on standard natural language processing benchmarks. However, the resultant model generalizability remains poorly understood. We do not know, for example, how excellent performance can lead to the perfection of generalization models. In this study, we analyze a fine-tuned BERT model from different perspectives using relation extraction. We also characterize the differences in generalization techniques according to our proposed improvements. From empirical experimentation, we find that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases (i.e., selection and semantic). These findings highlight opportunities for future improvements. Our open-sourced testbed DiagnoseRE is available in \url{https://github.com/zjunlp/DiagnoseRE}.
    Kformer: Knowledge Injection in Transformer Feed-Forward Layers. (arXiv:2201.05742v2 [cs.CL] CROSS LISTED)
    Recent days have witnessed a diverse set of knowledge injection models for pre-trained language models (PTMs); however, most previous studies neglect the PTMs' own ability with quantities of implicit knowledge stored in parameters. A recent study has observed knowledge neurons in the Feed Forward Network (FFN), which are responsible for expressing factual knowledge. In this work, we propose a simple model, Kformer, which takes advantage of the knowledge stored in PTMs and external knowledge via knowledge injection in Transformer FFN layers. Empirically results on two knowledge-intensive tasks, commonsense reasoning (i.e., SocialIQA) and medical question answering (i.e., MedQA-USMLE), demonstrate that Kformer can yield better performance than other knowledge injection technologies such as concatenation or attention-based injection. We think the proposed simple model and empirical findings may be helpful for the community to develop more powerful knowledge injection methods. Code available in https://github.com/zjunlp/Kformer.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    On the Probability of Necessity and Sufficiency of Explaining Graph Neural Networks: A Lower Bound Optimization Approach. (arXiv:2212.07056v2 [cs.LG] UPDATED)
    The explainability of Graph Neural Networks (GNNs) is critical to various GNN applications but remains an open challenge. A convincing explanation should be both necessary and sufficient simultaneously. However, existing GNN explaining approaches focus on only one of the two aspects, necessity or sufficiency, or a heuristic trade-off between the two. Theoretically, the Probability of Necessity and Sufficiency (PNS) can be applied to search for the most necessary and sufficient explanation since it can mathematically quantify the necessity and sufficiency of an explanation. Nevertheless, the difficulty of obtaining PNS due to non-monotonicity and the challenge of counterfactual estimation limit its wide use. To address the non-identifiability of PNS, we resort to a lower bound of PNS that can be optimized via counterfactual estimation, and propose Necessary and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. Specifically, we employ nearest neighbor matching to generate counterfactual samples and leverage continuous masks with a sampling strategy to optimize the lower bound. Empirical study shows that NSEG achieves excellent performance in generating the most necessary and sufficient explanations among a series of state-of-the-art methods.
    Learning to Ask for Data-Efficient Event Argument Extraction. (arXiv:2110.00479v1 [cs.CL] CROSS LISTED)
    Event argument extraction (EAE) is an important task for information extraction to discover specific argument roles. In this study, we cast EAE as a question-based cloze task and empirically analyze fixed discrete token template performance. As generating human-annotated question templates is often time-consuming and labor-intensive, we further propose a novel approach called "Learning to Ask," which can learn optimized question templates for EAE without human annotations. Experiments using the ACE-2005 dataset demonstrate that our method based on optimized questions achieves state-of-the-art performance in both the few-shot and supervised settings.
    Communication-Efficient Diffusion Strategy for Performance Improvement of Federated Learning with Non-IID Data. (arXiv:2207.07493v3 [cs.DC] UPDATED)
    Federated learning (FL) is a novel learning paradigm that addresses the privacy leakage challenge of centralized learning. However, in FL, users with non-independent and identically distributed (non-IID) characteristics can deteriorate the performance of the global model. Specifically, the global model suffers from the weight divergence challenge owing to non-IID data. To address the aforementioned challenge, we propose a novel diffusion strategy of the machine learning (ML) model (FedDif) to maximize the FL performance with non-IID data. In FedDif, users spread local models to neighboring users over D2D communications. FedDif enables the local model to experience different distributions before parameter aggregation. Furthermore, we theoretically demonstrate that FedDif can circumvent the weight divergence challenge. On the theoretical basis, we propose the communication-efficient diffusion strategy of the ML model, which can determine the trade-off between the learning performance and communication cost based on auction theory. The performance evaluation results show that FedDif improves the test accuracy of the global model by 10.37% compared to the baseline FL with non-IID settings. Moreover, FedDif improves the number of consumed sub-frames by 1.28 to 2.85 folds to the latest methods except for the model compression scheme. FedDif also improves the number of transmitted models by 1.43 to 2.67 folds to the latest methods.
    Relation Adversarial Network for Low Resource Knowledge Graph Completion. (arXiv:1911.03091v6 [cs.CL] CROSS LISTED)
    Knowledge Graph Completion (KGC) has been proposed to improve Knowledge Graphs by filling in missing connections via link prediction or relation extraction. One of the main difficulties for KGC is a low resource problem. Previous approaches assume sufficient training triples to learn versatile vectors for entities and relations, or a satisfactory number of labeled sentences to train a competent relation extraction model. However, low resource relations are very common in KGs, and those newly added relations often do not have many known samples for training. In this work, we aim at predicting new facts under a challenging setting where only limited training instances are available. We propose a general framework called Weighted Relation Adversarial Network, which utilizes an adversarial procedure to help adapt knowledge/features learned from high resource relations to different but related low resource relations. Specifically, the framework takes advantage of a relation discriminator to distinguish between samples from different relations, and help learn relation-invariant features more transferable from source relations to target relations. Experimental results show that the proposed approach outperforms previous methods regarding low resource settings for both link prediction and relation extraction.
    Schema-aware Reference as Prompt Improves Data-Efficient Relational Triple and Event Extraction. (arXiv:2210.10709v3 [cs.CL] CROSS LISTED)
    Information Extraction, which aims to extract structural relational triple or event from unstructured texts, often suffers from data scarcity issues. With the development of pre-trained language models, many prompt-based approaches to data-efficient information extraction have been proposed and achieved impressive performance. However, existing prompt learning methods for information extraction are still susceptible to several potential limitations: (i) semantic gap between natural language and output structure knowledge with pre-defined schema; (ii) representation learning with locally individual instances limits the performance given the insufficient features. In this paper, we propose a novel approach of schema-aware Reference As Prompt (RAP), which dynamically leverage schema and knowledge inherited from global (few-shot) training data for each sample. Specifically, we propose a schema-aware reference store, which unifies symbolic schema and relevant textual instances. Then, we employ a dynamic reference integration module to retrieve pertinent knowledge from the datastore as prompts during training and inference. Experimental results demonstrate that RAP can be plugged into various existing models and outperforms baselines in low-resource settings on four datasets of relational triple extraction and event extraction. In addition, we provide comprehensive empirical ablations and case analysis regarding different types and scales of knowledge in order to better understand the mechanisms of RAP. Code is available in https://github.com/zjunlp/RAP.
    LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training. (arXiv:2112.01404v2 [cs.CL] CROSS LISTED)
    Natural language generation from structured data mainly focuses on surface-level descriptions, suffering from uncontrollable content selection and low fidelity. Previous works leverage logical forms to facilitate logical knowledge-conditioned text generation. Though achieving remarkable progress, they are data-hungry, which makes the adoption for real-world applications challenging with limited data. To this end, this paper proposes a unified framework for logical knowledge-conditioned text generation in the few-shot setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach leverages self-training and samples pseudo logical forms based on content and structure consistency. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.
    Document-level Relation Extraction as Semantic Segmentation. (arXiv:2106.03618v2 [cs.CL] CROSS LISTED)
    Document-level relation extraction aims to extract relations among multiple entity pairs from a document. Previously proposed graph-based or transformer-based models utilize the entities independently, regardless of global information among relational triples. This paper approaches the problem by predicting an entity-level relation matrix to capture local and global information, parallel to the semantic segmentation task in computer vision. Herein, we propose a Document U-shaped Network for document-level relation extraction. Specifically, we leverage an encoder module to capture the context information of entities and a U-shaped segmentation module over the image-style feature map to capture global interdependency among triples. Experimental results show that our approach can obtain state-of-the-art performance on three benchmark datasets DocRED, CDR, and GDA.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    Variance-Reduced Conservative Policy Iteration. (arXiv:2212.06283v2 [cs.LG] UPDATED)
    We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.
    Linear TreeShap. (arXiv:2209.08192v2 [cs.LG] UPDATED)
    Decision trees are well-known due to their ease of interpretability. To improve accuracy, we need to grow deep trees or ensembles of trees. These are hard to interpret, offsetting their original benefits. Shapley values have recently become a popular way to explain the predictions of tree-based machine learning models. It provides a linear weighting to features independent of the tree structure. The rise in popularity is mainly due to TreeShap, which solves a general exponential complexity problem in polynomial time. Following extensive adoption in the industry, more efficient algorithms are required. This paper presents a more efficient and straightforward algorithm: Linear TreeShap. Like TreeShap, Linear TreeShap is exact and requires the same amount of memory.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    A Sequential Deep Learning Algorithm for Sampled Mixed-integer Optimisation Problems. (arXiv:2301.10703v1 [math.OC])
    Mixed-integer optimisation problems can be computationally challenging. Here, we introduce and analyse two efficient algorithms with a specific sequential design that are aimed at dealing with sampled problems within this class. At each iteration step of both algorithms, we first test the feasibility of a given test solution for each and every constraint associated with the sampled optimisation at hand, while also identifying those constraints that are violated. Subsequently, an optimisation problem is constructed with a constraint set consisting of the current basis -- namely the smallest set of constraints that fully specifies the current test solution -- as well as constraints related to a limited number of the identified violating samples. We show that both algorithms exhibit finite-time convergence towards the optimal solution. Algorithm 2 features a neural network classifier that notably improves the computational performance compared to Algorithm 1. We establish quantitatively the efficacy of these algorithms by means of three numerical tests: robust optimal power flow, robust unit commitment, and robust random mixed-integer linear program.
    Obstacle Identification and Ellipsoidal Decomposition for Fast Motion Planning in Unknown Dynamic Environments. (arXiv:2209.14233v2 [cs.RO] UPDATED)
    Collision avoidance in the presence of dynamic obstacles in unknown environments is one of the most critical challenges for unmanned systems. In this paper, we present a method that identifies obstacles in terms of ellipsoids to estimate linear and angular obstacle velocities. Our proposed method is based on the idea of any object can be approximately expressed by ellipsoids. To achieve this, we propose a method based on variational Bayesian estimation of Gaussian mixture model, the Kyachiyan algorithm, and a refinement algorithm. Our proposed method does not require knowledge of the number of clusters and can operate in real-time, unlike existing optimization-based methods. In addition, we define an ellipsoid-based feature vector to match obstacles given two timely close point frames. Our method can be applied to any environment with static and dynamic obstacles, including the ones with rotating obstacles. We compare our algorithm with other clustering methods and show that when coupled with a trajectory planner, the overall system can efficiently traverse unknown environments in the presence of dynamic obstacles.
    A General Stochastic Optimization Framework for Convergence Bidding. (arXiv:2210.06543v3 [math.OC] UPDATED)
    Convergence (virtual) bidding is an important part of two-settlement electric power markets as it can effectively reduce discrepancies between the day-ahead and real-time markets. Consequently, there is extensive research into the bidding strategies of virtual participants aiming to obtain optimal bids to submit to the day-ahead market. In this paper, we introduce a price-based general stochastic optimization framework to obtain optimal convergence bid curves. Within this framework, we develop a computationally tractable linear programming-based optimization model, which produces bid prices and volumes simultaneously. We also show that different approximations and simplifications in the general model lead naturally to state-of-the-art convergence bidding approaches, such as self-scheduling and opportunistic approaches. Our general framework also provides a straightforward way to compare the performance of these models, which is demonstrated by numerical experiments on the California (CAISO) market.
    Adversarial De-confounding in Individualised Treatment Effects Estimation. (arXiv:2210.10530v3 [cs.LG] UPDATED)
    Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.
    Context-aware Deep Model for Entity Recommendation in Search Engine at Alibaba. (arXiv:1909.04493v1 [cs.IR] CROSS LISTED)
    Entity recommendation, providing search users with an improved experience via assisting them in finding related entities for a given query, has become an indispensable feature of today's search engines. Existing studies typically only consider the queries with explicit entities. They usually fail to handle complex queries that without entities, such as "what food is good for cold weather", because their models could not infer the underlying meaning of the input text. In this work, we believe that contexts convey valuable evidence that could facilitate the semantic modeling of queries, and take them into consideration for entity recommendation. In order to better model the semantics of queries and entities, we learn the representation of queries and entities jointly with attentive deep neural networks. We evaluate our approach using large-scale, real-world search logs from a widely used commercial Chinese search engine. Our system has been deployed in ShenMa Search Engine and you can fetch it in UC Browser of Alibaba. Results from online A/B test suggest that the impression efficiency of click-through rate increased by 5.1% and page view increased by 5.5%.
    Convergence of Random Reshuffling Under The Kurdyka-{\L}ojasiewicz Inequality. (arXiv:2110.04926v4 [math.OC] UPDATED)
    We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method.
    A systematic review of biologically-informed deep learning models for cancer: fundamental trends for encoding and interpreting oncology data. (arXiv:2207.00812v3 [q-bio.QM] UPDATED)
    There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. This systematic review discusses DL models used to support inference in cancer biology with a particular emphasis on multi-omics analysis. It focuses on how existing models address the need for better dialogue with prior knowledge, biological plausibility and interpretability, fundamental properties in the biomedical domain. For this, we retrieved and analyzed 42 studies focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. We discuss the recent evolutionary arch of DL models in the direction of integrating prior biological relational and network knowledge to support better generalisation (e.g. pathways or Protein-Protein-Interaction networks) and interpretability. This represents a fundamental functional shift towards models which can integrate mechanistic and statistical inference aspects. We introduce a concept of bio-centric interpretability and according to its taxonomy, we discuss representational methodologies for the integration of domain prior knowledge in such models. The paper provides a critical outlook into contemporary methods for explainability and interpretabiltiy used in DL for cancer. The analysis points in the direction of a convergence between encoding prior knowledge and improved interpretability. We introduce bio-centric interpretability which is an important step towards formalisation of biological interpretability of DL models and developing methods that are less problem- or application-specific.
    Asymptotic Analysis of Deep Residual Networks. (arXiv:2212.08199v2 [cs.LG] UPDATED)
    We investigate the asymptotic properties of deep Residual networks (ResNets) as the number of layers increases. We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature. We study the convergence of the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a stochastic differential equation (SDE) or neither of these. In particular, our findings point to the existence of a diffusive regime in which the deep network limit is described by a class of stochastic differential equations (SDEs). Finally, we derive the corresponding scaling limits for the backpropagation dynamics.
    To be or not to be stable, that is the question: understanding neural networks for inverse problems. (arXiv:2211.13692v2 [math.NA] UPDATED)
    The solution of linear inverse problems arising, for example, in signal and image processing is a challenging problem since the ill-conditioning amplifies the noise on the data. Recently introduced algorithms based on deep learning overwhelm the more traditional model-based approaches, but they typically suffer from instability with respect to data perturbation. In this paper, we theoretically analyze the trade-off between neural networks stability and accuracy in the solution of linear inverse problems. Moreover, we propose different supervised and unsupervised solutions to increase network stability that maintains good accuracy by inheriting, in the network training, regularization from a model-based iterative scheme. Extensive numerical experiments on image deblurring confirm the theoretical results and the effectiveness of the proposed deep learning-based solutions to stably solve noisy inverse problems.
    Connecting metrics for shape-texture knowledge in computer vision. (arXiv:2301.10608v1 [cs.CV])
    Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families.
    FewShotTextGCN: K-hop neighborhood regularization for few-shot learning on graphs. (arXiv:2301.10481v1 [cs.CL])
    We present FewShotTextGCN, a novel method designed to effectively utilize the properties of word-document graphs for improved learning in low-resource settings. We introduce K-hop Neighbourhood Regularization, a regularizer for heterogeneous graphs, and show that it stabilizes and improves learning when only a few training samples are available. We furthermore propose a simplification in the graph-construction method, which results in a graph that is $\sim$7 times less dense and yields better performance in little-resource settings while performing on par with the state of the art in high-resource settings. Finally, we introduce a new variant of Adaptive Pseudo-Labeling tailored for word-document graphs. When using as little as 20 samples for training, we outperform a strong TextGCN baseline with 17% in absolute accuracy on average over eight languages. We demonstrate that our method can be applied to document classification without any language model pretraining on a wide range of typologically diverse languages while performing on par with large pretrained language models.
    User-Interactive Offline Reinforcement Learning. (arXiv:2205.10629v2 [cs.LG] UPDATED)
    Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.
    Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games. (arXiv:2111.06008v3 [cs.LG] UPDATED)
    Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium at the rate of $\tilde{O}(T^{-1})$. This substantially improves over the prior best rate of convergence for correlated equilibria of $O(T^{-3/4})$ due to Chen and Peng (NeurIPS`20), and it is optimal -- within the no-regret framework -- up to polylogarithmic factors in $T$. To obtain these results, we develop new techniques for establishing higher-order smoothness for learning dynamics involving fixed point operations. Specifically, we establish that the no-internal-regret learning dynamics of Stoltz and Lugosi (Mach Learn`05) are equivalently simulated by no-external-regret dynamics on a combinatorial space. This allows us to trade the computation of the stationary distribution on a polynomial-sized Markov chain for a (much more well-behaved) linear transformation on an exponential-sized set, enabling us to leverage similar techniques as DFG to near-optimally bound the internal regret. Moreover, we establish an $O(\textrm{polylog}(T))$ no-swap-regret bound for the classic algorithm of Blum and Mansour (BM) (JMLR`07). We do so by introducing a technique based on the Cauchy Integral Formula that circumvents the more limited combinatorial arguments of DFG. In addition to shedding clarity on the near-optimal regret guarantees of BM, our arguments provide insights into the various ways in which the techniques by DFG can be extended and leveraged in the analysis of more involved learning algorithms.
    Distributed Control of Partial Differential Equations Using Convolutional Reinforcement Learning. (arXiv:2301.10737v1 [cs.LG])
    We present a convolutional framework which significantly reduces the complexity and thus, the computational effort for distributed reinforcement learning control of dynamical systems governed by partial differential equations (PDEs). Exploiting translational invariances, the high-dimensional distributed control problem can be transformed into a multi-agent control problem with many identical, uncoupled agents. Furthermore, using the fact that information is transported with finite velocity in many cases, the dimension of the agents' environment can be drastically reduced using a convolution operation over the state space of the PDE. In this setting, the complexity can be flexibly adjusted via the kernel width or by using a stride greater than one. Moreover, scaling from smaller to larger systems -- or the transfer between different domains -- becomes a straightforward task requiring little effort. We demonstrate the performance of the proposed framework using several PDE examples with increasing complexity, where stabilization is achieved by training a low-dimensional deep deterministic policy gradient agent using minimal computing resources.
    Tighter Bounds on the Expressivity of Transformer Encoders. (arXiv:2301.10743v1 [cs.LG])
    Characterizing neural networks in terms of better-understood formal systems has the potential to yield new insights into the power and limitations of these networks. Doing so for transformers remains an active area of research. Bhattamishra and others have shown that transformer encoders are at least as expressive as a certain kind of counter machine, while Merrill and Sabharwal have shown that fixed-precision transformer encoders recognize only languages in uniform $TC^0$. We connect and strengthen these results by identifying a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders. This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
    Spatio-Temporal Graph Neural Networks: A Survey. (arXiv:2301.10569v1 [cs.LG])
    Graph Neural Networks have gained huge interest in the past few years. These powerful algorithms expanded deep learning models to non-Euclidean space and were able to achieve state of art performance in various applications including recommender systems and social networks. However, this performance is based on static graph structures assumption which limits the Graph Neural Networks performance when the data varies with time. Temporal Graph Neural Networks are extension of Graph Neural Networks that takes the time factor into account. Recently, various Temporal Graph Neural Network algorithms were proposed and achieved superior performance compared to other deep learning algorithms in several time dependent applications. This survey discusses interesting topics related to Spatio temporal Graph Neural Networks, including algorithms, application, and open challenges.
    Evaluation of the syllables pronunciation quality in speech rehabilitation through the solution of the classification problem. (arXiv:2301.10585v1 [cs.LG])
    The solution of the problem of assessing the quality of the pronunciation of syllables during speech rehabilitation after surgical treatment of oncological diseases of the organs of the speech-forming tract is considered in the work. The assessment is carried out by solving the problem of classifying syllables into two classes: before and immediately after surgical treatment. A classifier is built on the basis of the LSTM neural network and trained on the records before the operation and immediately after it, before the start of speech rehabilitation. The measure of assessing the quality of syllables pronunciation in the process of rehabilitation is the metric of belonging to the class before the operation. A study is being made of the influence of taking into account problematic phonemes, the gender of the patient, his individual characteristics on the resulting estimates of the quality of pronunciation. A comparison with existing types of syllable pronunciation quality assessments is carried out, recommendations are given for the practical application of the resulting new class of pronunciation quality assessments.
    Prediction of COVID-19 by Its Variants using Multivariate Data-driven Deep Learning Models. (arXiv:2301.10616v1 [cs.CE])
    The Coronavirus Disease 2019 or the COVID-19 pandemic has swept almost all parts of the world since the first case was found in Wuhan, China, in December 2019. With the increasing number of COVID-19 cases in the world, SARS-CoV-2 has mutated into various variants. Given the increasingly dangerous conditions of the pandemic, it is crucial to know when the pandemic will stop by predicting confirmed cases of COVID-19. Therefore, many studies have raised COVID-19 as a case study to overcome the ongoing pandemic using the Deep Learning method, namely LSTM, with reasonably accurate results and small error values. LSTM training is used to predict confirmed cases of COVID-19 based on variants that have been identified using ECDC's COVID-19 dataset containing confirmed cases of COVID-19 that have been identified from 30 countries in Europe. Tests were conducted using the LSTM and BiLSTM models with the addition of RNN as comparisons on hidden size and layer size. The obtained result showed that in testing hidden sizes 25, 50, 75 to 100, the RNN model provided better results, with the minimum MSE value of 0.01 and the RMSE value of 0.012 for B.1.427/B.1.429 variant with hidden size 100. In further testing of layer sizes 2, 3, 4, and 5, the result shows that the BiLSTM model provided better results, with minimum MSE value of 0.01 and the RMSE of 0.01 for the B.1.427/B.1.429 variant with hidden size 100 and layer size 2.
    Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs. (arXiv:2106.02684v3 [cs.LG] UPDATED)
    We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.
    Trainable Loss Weights in Super-Resolution. (arXiv:2301.10575v1 [cs.CV])
    In recent years, research on super-resolution has primarily focused on the development of unsupervised models, blind networks, and the use of optimization methods in non-blind models. But, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity in a conventional way. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network using this criterion. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while keeping the output components between zero and one. As shown in the experimental results section, weighted loss by the proposed method leads to better results than the unweighted loss in both signal-to-noise and perceptual similarity senses.
    Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and Self-Training. (arXiv:2206.11215v3 [cs.CV] UPDATED)
    We consider a certifiable object pose estimation problem, where -- given a partial point cloud of an object -- the goal is to not only estimate the object pose, but also to provide a certificate of correctness for the resulting estimate. Our first contribution is a general theory of certification for end-to-end perception models. In particular, we introduce the notion of $\zeta$-correctness, which bounds the distance between an estimate and the ground truth. We show that $\zeta$-correctness can be assessed by implementing two certificates: (i) a certificate of observable correctness, that asserts if the model output is consistent with the input data and prior information, (ii) a certificate of non-degeneracy, that asserts whether the input data is sufficient to compute a unique estimate. Our second contribution is to apply this theory and design a new learning-based certifiable pose estimator. We propose C-3PO, a semantic-keypoint-based pose estimation model, augmented with the two certificates, to solve the certifiable pose estimation problem. C-3PO also includes a keypoint corrector, implemented as a differentiable optimization layer, that can correct large detection errors (e.g. due to the sim-to-real gap). Our third contribution is a novel self-supervised training approach that uses our certificate of observable correctness to provide the supervisory signal to C-3PO during training. In it, the model trains only on the observably correct input-output pairs, in each training iteration. As training progresses, we see that the observably correct input-output pairs grow, eventually reaching near 100% in many cases. Our experiments show that (i) standard semantic-keypoint-based methods outperform more recent alternatives, (ii) C-3PO further improves performance and significantly outperforms all the baselines, and (iii) C-3PO's certificates are able to discern correct pose estimates.
    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction. (arXiv:2010.16059v1 [cs.CL] CROSS LISTED)
    Current supervised relational triple extraction approaches require huge amounts of labeled data and thus suffer from poor performance in few-shot settings. However, people can grasp new knowledge by learning a few instances. To this end, we take the first step to study the few-shot relational triple extraction, which has not been well understood. Unlike previous single-task few-shot problems, relational triple extraction is more challenging as the entities and relations have implicit correlations. In this paper, We propose a novel multi-prototype embedding network model to jointly extract the composition of relational triples, namely, entity pairs and corresponding relations. To be specific, we design a hybrid prototypical learning mechanism that bridges text and knowledge concerning both entities and relations. Thus, implicit correlations between entities and relations are injected. Additionally, we propose a prototype-aware regularization to learn more representative prototypes. Experimental results demonstrate that the proposed method can improve the performance of the few-shot triple extraction.
    LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting. (arXiv:2109.00720v5 [cs.CL] CROSS LISTED)
    Most NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Existing dominant approaches usually suffer from the challenge that the target domain has different label sets compared with a resource-rich source domain, which can be concluded as class transfer and domain transfer. In this paper, we propose a lightweight tuning paradigm for low-resource NER via pluggable prompting (LightNER). Specifically, we construct the unified learnable verbalizer of entity categories to generate the entity span sequence and entity categories without any label-specific classifiers, thus addressing the class transfer issue. We further propose a pluggable guidance module by incorporating learnable parameters into the self-attention layer as guidance, which can re-modulate the attention and adapt pre-trained weights. Note that we only tune those inserted module with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings. Code is in https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot.
    VN-Transformer: Rotation-Equivariant Attention for Vector Neurons. (arXiv:2206.04176v3 [cs.CV] UPDATED)
    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: $(i)$ we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; $(ii)$ we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; $(iii)$ we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; $(iv)$ we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.
    PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning. (arXiv:2301.10681v1 [cs.LG])
    Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
    Probing Taxonomic and Thematic Embeddings for Taxonomic Information. (arXiv:2301.10656v1 [cs.CL])
    Modelling taxonomic and thematic relatedness is important for building AI with comprehensive natural language understanding. The goal of this paper is to learn more about how taxonomic information is structurally encoded in embeddings. To do this, we design a new hypernym-hyponym probing task and perform a comparative probing study of taxonomic and thematic SGNS and GloVe embeddings. Our experiments indicate that both types of embeddings encode some taxonomic information, but the amount, as well as the geometric properties of the encodings, are independently related to both the encoder architecture, as well as the embedding training data. Specifically, we find that only taxonomic embeddings carry taxonomic information in their norm, which is determined by the underlying distribution in the data.
    XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. (arXiv:2301.10472v1 [cs.CL])
    Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
    A Boosting Approach to Reinforcement Learning. (arXiv:2108.09767v2 [cs.LG] UPDATED)
    Reducing reinforcement learning to supervised learning is a well-studied and effective approach that leverages the benefits of compact function approximation to deal with large-scale Markov decision processes. Independently, the boosting methodology (e.g. AdaBoost) has proven to be indispensable in designing efficient and accurate classification algorithms by combining inaccurate rules-of-thumb. In this paper, we take a further step: we reduce reinforcement learning to a sequence of weak learning problems. Since weak learners perform only marginally better than random guesses, such subroutines constitute a weaker assumption than the availability of an accurate supervised learning oracle. We prove that the sample complexity and running time bounds of the proposed method do not explicitly depend on the number of states. While existing results on boosting operate on convex losses, the value function over policies is non-convex. We show how to use a non-convex variant of the Frank-Wolfe method for boosting, that additionally improves upon the known sample complexity and running time even for reductions to supervised learning.
    Transfer Learning in Deep Learning Models for Building Load Forecasting: Case of Limited Data. (arXiv:2301.10663v1 [cs.LG])
    Precise load forecasting in buildings could increase the bill savings potential and facilitate optimized strategies for power generation planning. With the rapid evolution of computer science, data-driven techniques, in particular the Deep Learning models, have become a promising solution for the load forecasting problem. These models have showed accurate forecasting results; however, they need abundance amount of historical data to maintain the performance. Considering the new buildings and buildings with low resolution measuring equipment, it is difficult to get enough historical data from them, leading to poor forecasting performance. In order to adapt Deep Learning models for buildings with limited and scarce data, this paper proposes a Building-to-Building Transfer Learning framework to overcome the problem and enhance the performance of Deep Learning models. The transfer learning approach was applied to a new technique known as Transformer model due to its efficacy in capturing data trends. The performance of the algorithm was tested on a large commercial building with limited data. The result showed that the proposed approach improved the forecasting accuracy by 56.8% compared to the case of conventional deep learning where training from scratch is used. The paper also compared the proposed Transformer model to other sequential deep learning models such as Long-short Term Memory (LSTM) and Recurrent Neural Network (RNN). The accuracy of the transformer model outperformed other models by reducing the root mean square error to 0.009, compared to LSTM with 0.011 and RNN with 0.051.
    Dimensionality Expansion of Load Monitoring Time Series and Transfer Learning for EMS. (arXiv:2204.02802v3 [cs.LG] UPDATED)
    Energy management systems (EMS) rely on (non)-intrusive load monitoring (N)ILM to monitor and manage appliances and help residents be more energy efficient and thus more frugal. The robustness as well as the transfer potential of the most promising machine learning solutions for (N)ILM is not yet fully understood as they are trained and evaluated on relatively limited data. In this paper, we propose a new approach for load monitoring in building EMS based on dimensionality expansion of time series and transfer learning. We perform an extensive evaluation on 5 different low-frequency datasets. The proposed feature dimensionality expansion using video-like transformation and resource-aware deep learning architecture achieves an average weighted F1 score of 0.88 across the datasets with 29 appliances and is computationally more efficient compared to the state-of-the-art imaging methods. Investigating the proposed method for cross-dataset intra-domain transfer learning, we find that 1) our method performs with an average weighted F1 score of 0.80 while requiring 3-times fewer epochs for model training compared to the non-transfer approach, 2) can achieve an F1 score of 0.75 with only 230 data samples, and 3) our transfer approach outperforms the state-of-the-art in precision drop by up to 12 percentage points for unseen appliances.
    What are the Machine Learning best practices reported by practitioners on Stack Exchange?. (arXiv:2301.10516v1 [cs.SE])
    Machine Learning (ML) is being used in multiple disciplines due to its powerful capability to infer relationships within data. In particular, Software Engineering (SE) is one of those disciplines in which ML has been used for multiple tasks, like software categorization, bugs prediction, and testing. In addition to the multiple ML applications, some studies have been conducted to detect and understand possible pitfalls and issues when using ML. However, to the best of our knowledge, only a few studies have focused on presenting ML best practices or guidelines for the application of ML in different domains. In addition, the practices and literature presented in previous literature (i) are domain-specific (e.g., concrete practices in biomechanics), (ii) describe few practices, or (iii) the practices lack rigorous validation and are presented in gray literature. In this paper, we present a study listing 127 ML best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites and validated by four independent ML experts. The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system; for each practice, we include explanations and examples. In all the practices, the provided examples focus on SE tasks. We expect this list of practices could help practitioners to understand better the practices and use ML in a more informed way, in particular newcomers to this new area that sits at the intersection of software engineering and machine learning.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Backward Compatibility During Data Updates by Weight Interpolation. (arXiv:2301.10546v1 [cs.LG])
    Backward compatibility of model predictions is a desired property when updating a machine learning driven application. It allows to seamlessly improve the underlying model without introducing regression bugs. In classification tasks these bugs occur in the form of negative flips. This means an instance that was correctly classified by the old model is now classified incorrectly by the updated model. This has direct negative impact on the user experience of such systems e.g. a frequently used voice assistant query is suddenly misclassified. A common reason to update the model is when new training data becomes available and needs to be incorporated. Simply retraining the model with the updated data introduces the unwanted negative flips. We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI). This method interpolates between the weights of the old and new model and we show in extensive experiments that it reduces negative flips without sacrificing the improved accuracy of the new model. BCWI is straight forward to implement and does not increase inference cost. We also explore the use of importance weighting during interpolation and averaging the weights of multiple new models in order to further reduce negative flips.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    Automated multilingual detection of Pro-Kremlin propaganda in newspapers and Telegram posts. (arXiv:2301.10604v1 [cs.CL])
    The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amount of news articles and social media data reflecting opposing ideologies and narratives. These polarized campaigns have led to mutual accusations of misinformation and fake news, shaping an atmosphere of confusion and mistrust for readers worldwide. This study analyses how the media affected and mirrored public opinion during the first month of the war using news articles and Telegram news channels in Ukrainian, Russian, Romanian and English. We propose and compare two methods of multilingual automated pro-Kremlin propaganda identification, based on Transformers and linguistic features. We analyse the advantages and disadvantages of both methods, their adaptability to new genres and languages, and ethical considerations of their usage for content moderation. With this work, we aim to lay the foundation for further development of moderation tools tailored to the current conflict.  ( 2 min )
    E(n)-equivariant Graph Neural Cellular Automata. (arXiv:2301.10497v1 [cs.LG])
    Cellular automata (CAs) are computational models exhibiting rich dynamics emerging from the local interaction of cells arranged in a regular lattice. Graph CAs (GCAs) generalise standard CAs by allowing for arbitrary graphs rather than regular lattices, similar to how Graph Neural Networks (GNNs) generalise Convolutional NNs. Recently, Graph Neural CAs (GNCAs) have been proposed as models built on top of standard GNNs that can be trained to approximate the transition rule of any arbitrary GCA. Existing GNCAs are anisotropic in the sense that their transition rules are not equivariant to translation, rotation, and reflection of the nodes' spatial locations. However, it is desirable for instances related by such transformations to be treated identically by the model. By replacing standard graph convolutions with E(n)-equivariant ones, we avoid anisotropy by design and propose a class of isotropic automata that we call E(n)-GNCAs. These models are lightweight, but can nevertheless handle large graphs, capture complex dynamics and exhibit emergent self-organising behaviours. We showcase the broad and successful applicability of E(n)-GNCAs on three different tasks: (i) pattern formation, (ii) graph auto-encoding, and (iii) simulation of E(n)-equivariant dynamical systems.  ( 2 min )
    When to Trust Aggregated Gradients: Addressing Negative Client Sampling in Federated Learning. (arXiv:2301.10400v1 [cs.LG])
    Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate and can effectively reflect the merged data distribution of sampled clients, and we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method.  ( 2 min )
    Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes. (arXiv:2206.08852v2 [cs.LG] UPDATED)
    Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.
    Improved Stock Price Movement Classification Using News Articles Based on Embeddings and Label Smoothing. (arXiv:2301.10458v1 [cs.LG])
    Stock price movement prediction is a challenging and essential problem in finance. While it is well established in modern behavioral finance that the share prices of related stocks often move after the release of news via reactions and overreactions of investors, how to capture the relationships between price movements and news articles via quantitative models is an active area research; existing models have achieved success with variable degrees. In this paper, we propose to improve stock price movement classification using news articles by incorporating regularization and optimization techniques from deep learning. More specifically, we capture the dependencies between news articles and stocks through embeddings and bidirectional recurrent neural networks as in recent models. We further incorporate weight decay, batch normalization, dropout, and label smoothing to improve the generalization of the trained models. To handle high fluctuations of validation accuracy of batch normalization, we propose dual-phase training to realize the improvements reliably. Our experimental results on a commonly used dataset show significant improvements, achieving average accuracy of 80.7% on the test set, which is more than 10.0% absolute improvement over existing models. Our ablation studies show batch normalization and label smoothing are most effective, leading to 6.0% and 3.4% absolute improvement, respectively on average.  ( 2 min )
    SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators. (arXiv:2301.10388v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they overcome the limited applicability of prior neural networks. A GCN takes as input an arbitrarily structured graph and executes a series of layers which exploit the graph's structure to calculate their output features. One recent trend in GCNs is the use of deep network architectures. As opposed to the traditional GCNs which only span around two to five layers deep, modern GCNs now incorporate tens to hundreds of layers with the help of residual connections. From such deep GCNs, we find an important characteristic that they exhibit very high intermediate feature sparsity. We observe that with deep layers and residual connections, the number of zeros in the intermediate features sharply increases. This reveals a new opportunity for accelerators to exploit in GCN executions that was previously not present. In this paper, we propose SGCN, a fast and energy-efficient GCN accelerator which fully exploits the sparse intermediate features of modern GCNs. SGCN suggests several techniques to achieve significantly higher performance and energy efficiency than the existing accelerators. First, SGCN employs a GCN-friendly feature compression format. We focus on reducing the off-chip memory traffic, which often is the bottleneck for GCN executions. Second, we propose microarchitectures for seamlessly handling the compressed feature format. Third, to better handle locality in the existence of the varying sparsity, SGCN employs sparsity-aware cooperation. Sparsity-aware cooperation creates a pattern that exhibits multiple reuse windows, such that the cache can capture diverse sizes of working sets and therefore adapt to the varying level of sparsity. We show that SGCN achieves 1.71x speedup and 43.9% higher energy efficiency compared to the existing accelerators.  ( 2 min )
    Overcoming Prior Misspecification in Online Learning to Rank. (arXiv:2301.10651v1 [cs.LG])
    The recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior. In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback. Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.  ( 2 min )
    Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning. (arXiv:2301.10500v1 [cs.LG])
    We propose `Banker-OMD`, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The `Banker-OMD` framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus allowing the easy design of new algorithms capable of easily and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\tilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. `Banker-OMD` leads to the first delayed scale-free adversarial MAB algorithm achieving $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret and the first delayed adversarial linear bandit algorithm achieving $\tilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\tilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $\Omega(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.  ( 2 min )
    Integrating Local Real Data with Global Gradient Prototypes for Classifier Re-Balancing in Federated Long-Tailed Learning. (arXiv:2301.10394v1 [cs.LG])
    Federated Learning (FL) has become a popular distributed learning paradigm that involves multiple clients training a global model collaboratively in a data privacy-preserving manner. However, the data samples usually follow a long-tailed distribution in the real world, and FL on the decentralized and long-tailed data yields a poorly-behaved global model severely biased to the head classes with the majority of the training samples. To alleviate this issue, decoupled training has recently been introduced to FL, considering it has achieved promising results in centralized long-tailed learning by re-balancing the biased classifier after the instance-balanced training. However, the current study restricts the capacity of decoupled training in federated long-tailed learning with a sub-optimal classifier re-trained on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. In this work, in order to re-balance the classifier more effectively, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce an extra classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings.  ( 2 min )
    ViDeBERTa: A powerful pre-trained language model for Vietnamese. (arXiv:2301.10439v1 [cs.CL])
    This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.  ( 2 min )
    Capacity Analysis of Vector Symbolic Architectures. (arXiv:2301.10352v1 [cs.LG])
    Hyperdimensional computing (HDC) is a biologically-inspired framework that uses high-dimensional vectors and various vector operations to represent and manipulate symbols. The ensemble of a particular vector space and two vector operations (one addition-like for "bundling" and one outer-product-like for "binding") form what is called a "vector symbolic architecture" (VSA). While VSAs have been employed in numerous applications and studied empirically, many theoretical questions about VSAs remain open. We provide theoretical analyses for the *representation capacities* of three popular VSAs: MAP-I, MAP-B, and Binary Sparse. Representation capacity here refers to upper bounds on the dimensions of the VSA vectors required to perform certain symbolic tasks (such as testing for set membership $i \in S$ and estimating set intersection sizes $|S \cap T|$) to a given degree of accuracy. We also describe a relationship between the MAP-I VSA to Hopfield networks, which are simple models of associative memory, and analyze the ability of Hopfield networks to perform some of the same tasks that are typically asked of VSAs. Our analysis of MAP-I casts the VSA vectors as the outputs of *sketching* (dimensionality reduction) algorithms such as the Johnson-Lindenstrauss transform; this provides a clean, simple framework for obtaining bounds on MAP-I's representation capacity. We also provide, to our knowledge, the first analysis of testing set membership in a bundle of general pairwise bindings from MAP-I. Binary sparse VSAs are well-known to be related to Bloom filters; we give analyses of set intersection for Bloom and Counting Bloom filters. Our analysis of MAP-B and Binary Sparse bundling include new applications of several concentration inequalities.  ( 2 min )
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.  ( 2 min )
    HealthEdge: A Machine Learning-Based Smart Healthcare Framework for Prediction of Type 2 Diabetes in an Integrated IoT, Edge, and Cloud Computing System. (arXiv:2301.10450v1 [cs.LG])
    Diabetes Mellitus has no permanent cure to date and is one of the leading causes of death globally. The alarming increase in diabetes calls for the need to take precautionary measures to avoid/predict the occurrence of diabetes. This paper proposes HealthEdge, a machine learning-based smart healthcare framework for type 2 diabetes prediction in an integrated IoT-edge-cloud computing system. Numerical experiments and comparative analysis were carried out between the two most used machine learning algorithms in the literature, Random Forest (RF) and Logistic Regression (LR), using two real-life diabetes datasets. The results show that RF predicts diabetes with 6% more accuracy on average compared to LR.  ( 2 min )
    AutoCost: Evolving Intrinsic Cost for Zero-violation Reinforcement Learning. (arXiv:2301.10339v1 [cs.LG])
    Safety is a critical hurdle that limits the application of deep reinforcement learning (RL) to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision processes. However, such constrained RL methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safe RL benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.  ( 2 min )
    A Data-Centric Approach for Improving Adversarial Training Through the Lens of Out-of-Distribution Detection. (arXiv:2301.10454v1 [cs.LG])
    Current machine learning models achieve super-human performance in many real-world applications. Still, they are susceptible against imperceptible adversarial perturbations. The most effective solution for this problem is adversarial training that trains the model with adversarially perturbed samples instead of original ones. Various methods have been developed over recent years to improve adversarial training such as data augmentation or modifying training attacks. In this work, we examine the same problem from a new data-centric perspective. For this purpose, we first demonstrate that the existing model-based methods can be equivalent to applying smaller perturbation or optimization weights to the hard training examples. By using this finding, we propose detecting and removing these hard samples directly from the training procedure rather than applying complicated algorithms to mitigate their effects. For detection, we use maximum softmax probability as an effective method in out-of-distribution detection since we can consider the hard samples as the out-of-distribution samples for the whole data distribution. Our results on SVHN and CIFAR-10 datasets show the effectiveness of this method in improving the adversarial training without adding too much computational cost.  ( 2 min )
    Near-Optimal No-Regret Learning in General Games. (arXiv:2108.06924v2 [cs.LG] UPDATED)
    We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\rm poly}(\log T)$. Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias (Syrgkanis et al., 2015), and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games (Chen & Pen, 2020). A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\tilde{O}\left(\frac 1T\right)$.
    A Provable Splitting Approach for Symmetric Nonnegative Matrix Factorization. (arXiv:2301.10499v1 [cs.LG])
    The symmetric Nonnegative Matrix Factorization (NMF), a special but important class of the general NMF, has found numerous applications in data analysis such as various clustering tasks. Unfortunately, designing fast algorithms for the symmetric NMF is not as easy as for its nonsymmetric counterpart, since the latter admits the splitting property that allows state-of-the-art alternating-type algorithms. To overcome this issue, we first split the decision variable and transform the symmetric NMF to a penalized nonsymmetric one, paving the way for designing efficient alternating-type algorithms. We then show that solving the penalized nonsymmetric reformulation returns a solution to the original symmetric NMF. Moreover, we design a family of alternating-type algorithms and show that they all admit strong convergence guarantee: the generated sequence of iterates is convergent and converges at least sublinearly to a critical point of the original symmetric NMF. Finally, we conduct experiments on both synthetic data and real image clustering to support our theoretical results and demonstrate the performance of the alternating-type algorithms.  ( 2 min )
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System. (arXiv:2301.10577v1 [cs.CL])
    In this work, we present ARDIAS, a web-based application that aims to provide researchers with a full suite of discovery and collaboration tools. ARDIAS currently allows searching for authors and articles by name and gaining insights into the research topics of a particular researcher. With the aid of AI-based tools, ARDIAS aims to recommend potential collaborators and topics to researchers. In the near future, we aim to add tools that allow researchers to communicate with each other and start new projects.
    Multimodal Analogical Reasoning over Knowledge Graphs. (arXiv:2210.00312v3 [cs.CL] UPDATED)
    Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance. Code and datasets are available in https://github.com/zjunlp/MKG_Analogy.
    OCD: Learning to Overfit with Conditional Diffusion Models. (arXiv:2210.00471v4 [cs.LG] UPDATED)
    We present a dynamic model in which the weights are conditioned on an input sample x and are learned to match those that would be obtained by finetuning a base model on x and its label y. This mapping between an input sample and network weights is approximated by a denoising diffusion model. The diffusion model we employ focuses on modifying a single layer of the base model and is conditioned on the input, activations, and output of this layer. Since the diffusion model is stochastic in nature, multiple initializations generate different networks, forming an ensemble, which leads to further improvements. Our experiments demonstrate the wide applicability of the method for image classification, 3D reconstruction, tabular data, speech separation, and natural language processing. Our code is available at https://github.com/ShaharLutatiPersonal/OCD
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    A blob method for inhomogeneous diffusion with applications to multi-agent control and sampling. (arXiv:2202.12927v3 [math.AP] UPDATED)
    As a counterpoint to classical stochastic particle methods for linear diffusion equations, we develop a deterministic particle method for the weighted porous medium equation (WPME) and prove its convergence on bounded time intervals. This generalizes related work on blob methods for unweighted porous medium equations. From a numerical analysis perspective, our method has several advantages: it is meshfree, preserves the gradient flow structure of the underlying PDE, converges in arbitrary dimension, and captures the correct asymptotic behavior in simulations. That our method succeeds in capturing the long time behavior of WPME is significant from the perspective of related problems in quantization. Just as the Fokker-Planck equation provides a way to quantize a probability measure $\bar{\rho}$ by evolving an empirical measure according to stochastic Langevin dynamics so that the empirical measure flows toward $\bar{\rho}$, our particle method provides a way to quantize $\bar{\rho}$ according to deterministic particle dynamics approximating WMPE. In this way, our method has natural applications to multi-agent coverage algorithms and sampling probability measures. A specific case of our method corresponds exactly to confined mean-field dynamics of training a two-layer neural network for a radial basis function activation function. From this perspective, our convergence result shows that, in the overparametrized regime and as the variance of the radial basis functions goes to zero, the continuum limit is given by WPME. This generalizes previous results, which considered the case of a uniform data distribution, to the more general inhomogeneous setting. As a consequence of our convergence result, we identify conditions on the target function and data distribution for which convexity of the energy landscape emerges in the continuum limit.
    Learning to Rank Normalized Entropy Curves with Differentiable Window Transformation. (arXiv:2301.10443v1 [cs.LG])
    Recent automated machine learning systems often use learning curves ranking models to inform decisions about when to stop unpromising trials and identify better model configurations. In this paper, we present a novel learning curve ranking model specifically tailored for ranking normalized entropy (NE) learning curves, which are commonly used in online advertising and recommendation systems. Our proposed model, self-Adaptive Curve Transformation augmented Relative curve Ranking (ACTR2), features an adaptive curve transformation layer that transforms raw lifetime NE curves into composite window NE curves with the window sizes adaptively optimized based on both the position on the learning curve and the curve's dynamics. We also introduce a novel differentiable indexing method for the proposed adaptive curve transformation, which allows gradients with respect to the discrete indices to flow freely through the curve transformation layer, enabling the learned window sizes to be updated flexibly during training. Additionally, we propose a pairwise curve ranking architecture that directly models the difference between the two learning curves and is better at capturing subtle changes in relative performance that may not be evident when modeling each curve individually as the existing approaches did. Our extensive experiments on a real-world NE curve dataset demonstrate the effectiveness of each key component of ACTR2 and its improved performance over the state-of-the-art.
    MLPGradientFlow: going with the flow of multilayer perceptrons (and finding minima fast and accurately). (arXiv:2301.10638v1 [cs.LG])
    MLPGradientFlow is a software package to solve numerically the gradient flow differential equation $\dot \theta = -\nabla \mathcal L(\theta; \mathcal D)$, where $\theta$ are the parameters of a multi-layer perceptron, $\mathcal D$ is some data set, and $\nabla \mathcal L$ is the gradient of a loss function. We show numerically that adaptive first- or higher-order integration methods based on Runge-Kutta schemes have better accuracy and convergence speed than gradient descent with the Adam optimizer. However, we find Newton's method and approximations like BFGS preferable to find fixed points (local and global minima of $\mathcal L$) efficiently and accurately. For small networks and data sets, gradients are usually computed faster than in pytorch and Hessian are computed at least $5\times$ faster. Additionally, the package features an integrator for a teacher-student setup with bias-free, two-layer networks trained with standard Gaussian input in the limit of infinite data. The code is accessible at https://github.com/jbrea/MLPGradientFlow.jl.  ( 2 min )
    Understanding and Improving Deep Graph Neural Networks: A Probabilistic Graphical Model Perspective. (arXiv:2301.10536v1 [cs.LG])
    Recently, graph-based models designed for downstream tasks have significantly advanced research on graph neural networks (GNNs). GNN baselines based on neural message-passing mechanisms such as GCN and GAT perform worse as the network deepens. Therefore, numerous GNN variants have been proposed to tackle this performance degradation problem, including many deep GNNs. However, a unified framework is still lacking to connect these existing models and interpret their effectiveness at a high level. In this work, we focus on deep GNNs and propose a novel view for understanding them. We establish a theoretical framework via inference on a probabilistic graphical model. Given the fixed point equation (FPE) derived from the variational inference on the Markov random fields, the deep GNNs, including JKNet, GCNII, DGCN, and the classical GNNs, such as GCN, GAT, and APPNP, can be regarded as different approximations of the FPE. Moreover, given this framework, more accurate approximations of FPE are brought, guiding us to design a more powerful GNN: coupling graph neural network (CoGNet). Extensive experiments are carried out on citation networks and natural language processing downstream tasks. The results demonstrate that the CoGNet outperforms the SOTA models.  ( 2 min )
    DEJA VU: Continual Model Generalization For Unseen Domains. (arXiv:2301.10418v1 [cs.LG])
    In real-world applications, deep learning models often run in non-stationary environments where the target data distribution continually shifts over time. There have been numerous domain adaptation (DA) methods in both online and offline modes to improve cross-domain adaptation ability. However, these DA methods typically only provide good performance after a long period of adaptation, and perform poorly on new domains before and during adaptation - in what we call the "Unfamiliar Period", especially when domain shifts happen suddenly and significantly. On the other hand, domain generalization (DG) methods have been proposed to improve the model generalization ability on unadapted domains. However, existing DG works are ineffective for continually changing domains due to severe catastrophic forgetting of learned knowledge. To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains. RaTP includes a training-free data augmentation module to prepare data for TDG, a novel pseudo-labeling mechanism to provide reliable supervision for TDA, and a prototype contrastive alignment algorithm to align different domains for achieving TDG, TDA and FA. Extensive experiments on Digits, PACS, and DomainNet demonstrate that RaTP significantly outperforms state-of-the-art works from Continual DA, Source-Free DA, Test-Time/Online DA, Single DG, Multiple DG and Unified DA&DG in TDG, and achieves comparable TDA and FA capabilities.  ( 2 min )
    Off-Policy Evaluation for Action-Dependent Non-Stationary Environments. (arXiv:2301.10330v1 [cs.LG])
    Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (passive non-stationarity), changes induced by interactions with the system itself (active non-stationarity), or both (hybrid non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a higher-order stationarity assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity.  ( 2 min )
    Exact Fractional Inference via Re-Parametrization & Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms. (arXiv:2301.10369v1 [cs.LG])
    Inference efforts -- required to compute partition function, $Z$, of an Ising model over a graph of $N$ ``spins" -- are most likely exponential in $N$. Efficient variational methods, such as Belief Propagation (BP) and Tree Re-Weighted (TRW) algorithms, compute $Z$ approximately minimizing respective (BP- or TRW-) free energy. We generalize the variational scheme building a $\lambda$-fractional-homotopy, $Z^{(\lambda)}$, where $\lambda=0$ and $\lambda=1$ correspond to TRW- and BP-approximations, respectively, and $Z^{(\lambda)}$ decreases with $\lambda$ monotonically. Moreover, this fractional scheme guarantees that in the attractive (ferromagnetic) case $Z^{(TRW)}\geq Z^{(\lambda)}\geq Z^{(BP)}$, and there exists a unique (``exact") $\lambda_*$ such that, $Z=Z^{(\lambda_*)}$. Generalizing the re-parametrization approach of \cite{wainwright_tree-based_2002} and the loop series approach of \cite{chertkov_loop_2006}, we show how to express $Z$ as a product, $\forall \lambda:\ Z=Z^{(\lambda)}{\cal Z}^{(\lambda)}$, where the multiplicative correction, ${\cal Z}^{(\lambda)}$, is an expectation over a node-independent probability distribution built from node-wise fractional marginals. Our theoretical analysis is complemented by extensive experiments with models from Ising ensembles over planar and random graphs of medium- and large- sizes. The empirical study yields a number of interesting observations, such as (a) ability to estimate ${\cal Z}^{(\lambda)}$ with $O(N^4)$ fractional samples; (b) suppression of $\lambda_*$ fluctuations with increase in $N$ for instances from a particular random Ising ensemble.
    On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems. (arXiv:2301.10587v1 [cs.SD])
    The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these practices on resource utilization and more importantly network performance is not well documented. This paper is an empirical study of the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
    Audience-Centric Natural Language Generation via Style Infusion. (arXiv:2301.10283v1 [cs.CL])
    Adopting contextually appropriate, audience-tailored linguistic styles is critical to the success of user-centric language generation systems (e.g., chatbots, computer-aided writing, dialog systems). While existing approaches demonstrate textual style transfer with large volumes of parallel or non-parallel data, we argue that grounding style on audience-independent external factors is innately limiting for two reasons. First, it is difficult to collect large volumes of audience-specific stylistic data. Second, some stylistic objectives (e.g., persuasiveness, memorability, empathy) are hard to define without audience feedback. In this paper, we propose the novel task of style infusion - infusing the stylistic preferences of audiences in pretrained language generation models. Since humans are better at pairwise comparisons than direct scoring - i.e., is Sample-A more persuasive/polite/empathic than Sample-B - we leverage limited pairwise human judgments to bootstrap a style analysis model and augment our seed set of judgments. We then infuse the learned textual style in a GPT-2 based text generator while balancing fluency and style adoption. With quantitative and qualitative assessments, we show that our infusion approach can generate compelling stylized examples with generic text prompts. The code and data are accessible at https://github.com/CrowdDynamicsLab/StyleInfusion.  ( 2 min )
    NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification. (arXiv:2108.06158v4 [cs.LG] UPDATED)
    Gene-disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning setting in which only a subset of instances are labeled as positive while the rest of the data set is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on ten different disease data sets using three machine learning algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms.
    Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding. (arXiv:2111.15363v2 [cs.CV] UPDATED)
    Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves \sota performance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts).
    Lightweight Neural Architecture Search for Temporal Convolutional Networks at the Edge. (arXiv:2301.10281v1 [cs.LG])
    Neural Architecture Search (NAS) is quickly becoming the go-to approach to optimize the structure of Deep Learning (DL) models for complex tasks such as Image Classification or Object Detection. However, many other relevant applications of DL, especially at the edge, are based on time-series processing and require models with unique features, for which NAS is less explored. This work focuses in particular on Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged as a promising alternative to more complex recurrent architectures. We propose the first NAS tool that explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive-field and number of features in each layer. The proposed approach searches for networks that offer good trade-offs between accuracy and number of parameters/operations, enabling an efficient deployment on embedded platforms. We test the proposed NAS on four real-world, edge-relevant tasks, involving audio and bio-signals. Results show that, starting from a single seed network, our method is capable of obtaining a rich collection of Pareto optimal architectures, among which we obtain models with the same accuracy as the seed, and 15.9-152x fewer parameters. Compared to three state-of-the-art NAS tools, ProxylessNAS, MorphNet and FBNetV2, our method explores a larger search space for TCNs (up to 10^12x) and obtains superior solutions, while requiring low GPU memory and search time. We deploy our NAS outputs on two distinct edge devices, the multicore GreenWaves Technology GAP8 IoT processor and the single-core STMicroelectronics STM32H7 microcontroller. With respect to the state-of-the-art hand-tuned models, we reduce latency and energy of up to 5.5x and 3.8x on the two targets respectively, without any accuracy loss.  ( 3 min )
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.  ( 2 min )
    Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute. (arXiv:2301.10448v1 [cs.CL])
    Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size.  ( 2 min )
    One Model for All Domains: Collaborative Domain-Prefix Tuning for Cross-Domain NER. (arXiv:2301.10410v1 [cs.CL])
    Cross-domain NER is a challenging task to address the low-resource problem in practical scenarios. Previous typical solutions mainly obtain a NER model by pre-trained language models (PLMs) with data from a rich-resource domain and adapt it to the target domain. Owing to the mismatch issue among entity types in different domains, previous approaches normally tune all parameters of PLMs, ending up with an entirely new NER model for each domain. Moreover, current models only focus on leveraging knowledge in one general source domain while failing to successfully transfer knowledge from multiple sources to the target. To address these issues, we introduce Collaborative Domain-Prefix Tuning for cross-domain NER (CP-NER) based on text-to-text generative PLMs. Specifically, we present text-to-text generation grounding domain-related instructors to transfer knowledge to new domain NER tasks without structural modifications. We utilize frozen PLMs and conduct collaborative domain-prefix tuning to stimulate the potential of PLMs to handle NER tasks across various domains. Experimental results on the Cross-NER benchmark show that the proposed approach has flexible transfer ability and performs better on both one-source and multiple-source cross-domain NER tasks. Codes will be available in https://github.com/zjunlp/DeepKE/tree/main/example/ner/cross.  ( 2 min )
    Editing Language Model-based Knowledge Graph Embeddings. (arXiv:2301.10405v1 [cs.CL])
    Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, which are challenging to modify without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. The proposed task aims to enable data-efficient and fast updates to KG embeddings without damaging the performance of the rest. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hyper network to edit/add facts. Comprehensive experimental results demonstrate that KGEditor can perform better when updating specific facts while not affecting the rest with low training resources. Code and datasets will be available in https://github.com/zjunlp/PromptKG/tree/main/deltaKG.  ( 2 min )
    Designing Data: Proactive Data Collection and Iteration for Machine Learning. (arXiv:2301.10319v1 [cs.HC])
    Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track and manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model through Out-of-Distribution (OOD) methods. We instantiate designing data through our own data collection and applied ML case study. We find models trained on "designed" datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.  ( 2 min )
    Weakly Supervised Headline Dependency Parsing. (arXiv:2301.10371v1 [cs.CL])
    English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to evaluate existing state-of-the-art dependency parsers on news headlines. To improve English news headline parsing accuracies, we develop a projection method to bootstrap silver training data from unlabeled news headline-article lead sentence pairs. Models trained on silver headline parses demonstrate significant improvements in performance over models trained solely on gold-annotated long-form texts. Ultimately, we find that, although projected silver training data improves parser performance across different news outlets, the improvement is moderated by constructions idiosyncratic to outlet.  ( 2 min )
    Exact and rapid linear clustering of networks with dynamic programming. (arXiv:2301.10403v1 [cs.SI])
    We study the problem of clustering networks whose nodes have imputed or physical positions in a single dimension, such as prestige hierarchies or the similarity dimension of hyperbolic embeddings. Existing algorithms, such as the critical gap method and other greedy strategies, only offer approximate solutions. Here, we introduce a dynamic programming approach that returns provably optimal solutions in polynomial time -- O(n^2) steps -- for a broad class of clustering objectives. We demonstrate the algorithm through applications to synthetic and empirical networks, and show that it outperforms existing heuristics by a significant margin, with a similar execution time.  ( 2 min )
    Predicting mental health using social media: A roadmap for future development. (arXiv:2301.10453v1 [cs.IR])
    Mental disorders such as depression and suicidal ideation are hazardous, affecting more than 300 million people over the world. However, on social media, mental disorder symptoms can be observed, and automated approaches are increasingly capable of detecting them. The considerable number of social media users and the tremendous quantity of user-generated data on social platforms provide a unique opportunity for researchers to distinguish patterns that correlate with mental status. This research offers a roadmap for analysis, where mental state detection can be based on machine learning techniques. We describe the common approaches for predicting and identifying the disorder using user-generated content. This research is organized according to the data collection, feature extraction, and prediction algorithms. Furthermore, we review several recent studies conducted to explore different features of candidate profiles and their analytical methods. Following, we debate various aspects of the development of experimental auto-detection frameworks for identifying users who suffer from disorders, and we conclude with a discussion of future trends. The introduced methods can help complement screening procedures, identify at-risk people through social media monitoring on a large scale, and make disorders easier to treat in the future.  ( 2 min )
    Parameterizing the cost function of Dynamic Time Warping with application to time series classification. (arXiv:2301.10350v1 [cs.LG])
    Dynamic Time Warping (DTW) is a popular time series distance measure that aligns the points in two series with one another. These alignments support warping of the time dimension to allow for processes that unfold at differing rates. The distance is the minimum sum of costs of the resulting alignments over any allowable warping of the time dimension. The cost of an alignment of two points is a function of the difference in the values of those points. The original cost function was the absolute value of this difference. Other cost functions have been proposed. A popular alternative is the square of the difference. However, to our knowledge, this is the first investigation of both the relative impacts of using different cost functions and the potential to tune cost functions to different tasks. We do so in this paper by using a tunable cost function {\lambda}{\gamma} with parameter {\gamma}. We show that higher values of {\gamma} place greater weight on larger pairwise differences, while lower values place greater weight on smaller pairwise differences. We demonstrate that training {\gamma} significantly improves the accuracy of both the DTW nearest neighbor and Proximity Forest classifiers.  ( 2 min )
    Data Consistent Deep Rigid MRI Motion Correction. (arXiv:2301.10365v1 [eess.IV])
    Motion artifacts are a pervasive problem in MRI, leading to misdiagnosis or mischaracterization in population-level imaging studies. Current retrospective rigid intra-slice motion correction techniques jointly optimize estimates of the image and the motion parameters. In this paper, we use a deep network to reduce the joint image-motion parameter search to a search over rigid motion parameters alone. Our network produces a reconstruction as a function of two inputs: corrupted k-space data and motion parameters. We train the network using simulated, motion-corrupted k-space data generated from known motion parameters. At test-time, we estimate unknown motion parameters by minimizing a data consistency loss between the motion parameters, the network-based image reconstruction given those parameters, and the acquired measurements. Intra-slice motion correction experiments on simulated and realistic 2D fast spin echo brain MRI achieve high reconstruction fidelity while retaining the benefits of explicit data consistency-based optimization. Our code is publicly available at https://www.github.com/nalinimsingh/neuroMoCo.  ( 2 min )
    Multilingual Multiaccented Multispeaker TTS with RADTTS. (arXiv:2301.10335v1 [cs.SD])
    We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.  ( 2 min )
    Generating Multidimensional Clusters With Support Lines. (arXiv:2301.10327v1 [cs.LG])
    Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present \textit{Clugen}, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. \textit{Clugen} is open source, 100\% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.  ( 2 min )
    Score Matching via Differentiable Physics. (arXiv:2301.10250v1 [cs.LG])
    Diffusion models based on stochastic differential equations (SDEs) gradually perturb a data distribution $p(\mathbf{x})$ over time by adding noise to it. A neural network is trained to approximate the score $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ at time $t$, which can be used to reverse the corruption process. In this paper, we focus on learning the score field that is associated with the time evolution according to a physics operator in the presence of natural non-deterministic physical processes like diffusion. A decisive difference to previous methods is that the SDE underlying our approach transforms the state of a physical system to another state at a later time. For that purpose, we replace the drift of the underlying SDE formulation with a differentiable simulator or a neural network approximation of the physics. We propose different training strategies based on the so-called probability flow ODE to fit a training set of simulation trajectories and discuss their relation to the score matching objective. For inference, we sample plausible trajectories that evolve towards a given end state using the reverse-time SDE and demonstrate the competitiveness of our approach for different challenging inverse problems.  ( 2 min )
    Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction. (arXiv:2301.10309v1 [cs.LG])
    Crosslingual conditional generation (e.g., machine translation) has long enjoyed the benefits of scaling. Nonetheless, there are still issues that scale alone may not overcome. A source query in one language, for instance, may yield several translation options in another language without any extra context. Only one translation could be acceptable however, depending on the translator's preferences and goals. Choosing the incorrect option might significantly affect translation usefulness and quality. We propose a novel method interactive-chain prompting -- a series of question, answering and generation intermediate steps between a Translator model and a User model -- that reduces translations into a list of subproblems addressing ambiguities and then resolving such subproblems before producing the final text to be translated. To check ambiguity resolution capabilities and evaluate translation quality, we create a dataset exhibiting different linguistic phenomena which leads to ambiguities at inference for four languages. To encourage further exploration in this direction, we release all datasets. We note that interactive-chain prompting, using eight interactions as exemplars, consistently surpasses prompt-based methods with direct access to background information to resolve ambiguities.  ( 2 min )
    Towards Robust Metrics for Concept Representation Evaluation. (arXiv:2301.10367v1 [cs.LG])
    Recent work on interpretability has focused on concept-based explanations, where deep learning models are explained in terms of high-level units of information, referred to as concepts. Concept learning models, however, have been shown to be prone to encoding impurities in their representations, failing to fully capture meaningful features of their inputs. While concept learning lacks metrics to measure such phenomena, the field of disentanglement learning has explored the related notion of underlying factors of variation in the data, with plenty of metrics to measure the purity of such factors. In this paper, we show that such metrics are not appropriate for concept learning and propose novel metrics for evaluating the purity of concept representations in both approaches. We show the advantage of these metrics over existing ones and demonstrate their utility in evaluating the robustness of concept representations and interventions performed on them. In addition, we show their utility for benchmarking state-of-the-art methods from both families and find that, contrary to common assumptions, supervision alone may not be sufficient for pure concept representations.  ( 2 min )
    Language Model Detoxification in Dialogue with Contextualized Stance Control. (arXiv:2301.10368v1 [cs.CL])
    To reduce the toxic degeneration in a pretrained Language Model (LM), previous work on Language Model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. As a result, a type of implicit offensive language where the generations support the offensive language in the context is ignored. Different from the LM controlling tasks in previous work, where the desired attributes are fixed for generation, the desired stance of the generation depends on the offensiveness of the context. Therefore, we propose a novel control method to do context-dependent detoxification with the stance taken into consideration. We introduce meta prefixes to learn the contextualized stance control strategy and to generate the stance control prefix according to the input context. The generated stance prefix is then combined with the toxicity control prefix to guide the response generation. Experimental results show that our proposed method can effectively learn the context-dependent stance control strategies while keeping a low self-toxicity of the underlying LM.  ( 2 min )
    ClimaX: A foundation model for weather and climate. (arXiv:2301.10343v1 [cs.LG])
    Most state-of-the-art approaches for weather and climate modeling are based on physics-informed numerical models of the atmosphere. These approaches aim to model the non-linear dynamics and complex interactions between multiple variables, which are challenging to approximate. Additionally, many such numerical models are computationally intensive, especially when modeling the atmospheric phenomenon at a fine-grained spatial and temporal resolution. Recent data-driven approaches based on machine learning instead aim to directly solve a downstream forecasting or projection task by learning a data-driven functional mapping using deep neural networks. However, these networks are trained using curated and homogeneous climate datasets for specific spatiotemporal tasks, and thus lack the generality of numerical models. We develop and demonstrate ClimaX, a flexible and generalizable deep learning model for weather and climate science that can be trained using heterogeneous datasets spanning different variables, spatio-temporal coverage, and physical groundings. ClimaX extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility. ClimaX is pre-trained with a self-supervised learning objective on climate datasets derived from CMIP6. The pre-trained ClimaX can then be fine-tuned to address a breadth of climate and weather tasks, including those that involve atmospheric variables and spatio-temporal scales unseen during pretraining. Compared to existing data-driven baselines, we show that this generality in ClimaX results in superior performance on benchmarks for weather forecasting and climate projections, even when pretrained at lower resolutions and compute budgets.  ( 2 min )
    Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems. (arXiv:2301.10391v1 [cs.LG])
    We present a data-driven, space-time continuous framework to learn surrogatemodels for complex physical systems described by advection-dominated partialdifferential equations. Those systems have slow-decaying Kolmogorovn-widththat hinders standard methods, including reduced order modeling, from producinghigh-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compactrepresentation network. We leverage the expressive power of the network and aspecially designed consistency-inducing regularization to obtain latent trajectoriesthat are both low-dimensional and smooth. These properties render our surrogatemodels highly efficient at inference time. We show the efficacy of our frameworkby learning models that generate accurate multi-step rollout predictions at muchfaster inference speed compared to competitors, for several challenging examples.  ( 2 min )
    Learned Interferometric Imaging for the SPIDER Instrument. (arXiv:2301.10260v1 [astro-ph.IM])
    The Segmented Planar Imaging Detector for Electro-Optical Reconnaissance (SPIDER) is an optical interferometric imaging device that aims to offer an alternative to the large space telescope designs of today with reduced size, weight and power consumption. This is achieved through interferometric imaging. State-of-the-art methods for reconstructing images from interferometric measurements adopt proximal optimization techniques, which are computationally expensive and require handcrafted priors. In this work we present two data-driven approaches for reconstructing images from measurements made by the SPIDER instrument. These approaches use deep learning to learn prior information from training data, increasing the reconstruction quality, and significantly reducing the computation time required to recover images by orders of magnitude. Reconstruction time is reduced to ${\sim} 10$ milliseconds, opening up the possibility of real-time imaging with SPIDER for the first time. Furthermore, we show that these methods can also be applied in domains where training data is scarce, such as astronomical imaging, by leveraging transfer learning from domains where plenty of training data are available.  ( 2 min )
  • Open

    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    Posterior Covariance Information Criterion for Weighted Inference. (arXiv:2106.13694v4 [stat.ME] UPDATED)
    For predictive evaluation based on quasi-posterior distributions, we develop a new information criterion, the posterior covariance information criterion (PCIC. PCIC generalises the widely applicable information criterion WAIC so as to effectively handle predictive scenarios where likelihoods for the estimation and the evaluation of the model may be different. A typical example of such scenarios is the weighted likelihood inference, including prediction under covariate shift and counterfactual prediction. The proposed criterion utilises a posterior covariance form and is computed by using only one Markov chain Monte Carlo run. Through numerical examples, we demonstrate how PCIC can apply in practice. Further, we show that PCIC is asymptotically unbiased to the quasi-Bayesian generalization error under mild conditions in weighted inference with both regular and singular statistical models.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.
    Semiparametric discrete data regression with Monte Carlo inference and prediction. (arXiv:2110.12316v5 [stat.ME] UPDATED)
    Discrete data are abundant and often arise as counts or rounded data. These data commonly exhibit complex distributional features such as zero-inflation, over- or under-dispersion, boundedness, and heaping, which render many parametric models inadequate. Yet even for parametric regression models, approximations such as MCMC typically are needed for posterior inference. This paper introduces a Bayesian modeling and algorithmic framework that enables semiparametric regression analysis for discrete data with Monte Carlo (not MCMC) sampling. The proposed approach pairs a nonparametric marginal model with a latent linear regression model to encourage both flexibility and interpretability, and delivers posterior consistency even under model misspecification. For a parametric or large-sample approximation of this model, we identify a class of conjugate priors with (pseudo) closed-form posteriors. All posterior and predictive distributions are available analytically or via Monte Carlo sampling. These tools are broadly useful for linear regression, nonlinear models via basis expansions, and variable selection with discrete data. Simulation studies demonstrate significant advantages in computing, prediction, estimation, and selection relative to existing alternatives. This novel approach is applied to self-reported mental health data that exhibit zero-inflation, overdispersion, boundedness, and heaping.
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.

  • Open

    Insane face rendering A.I technology..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    AI Music - Eminem - 'Slim shady is alive'
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    Finding the right AI for a specific task
    Hi all, We're developing an internal application that groups customers together based on attributes that adhere to a ruleset on how they should be grouped. It does this fine. However, some nuance is then applied via human effort to modify groupings based on some customer notes (a text string) that sometimes dictate that two customers need to be in different groups for x reason, even if the original grouping adheres to the ruleset. The application itself has a UI that sorts customers into columns, which are manipulated by staff via dragging and dropping a customer/customers between one column and another. I had a thought to employ an AI model that compares the original generated grouping config that our code produces against the modified groupings that staff adjust based on that nuance. The idea is that we could analyze the why of a modification and use that insight to generate better default groupings. Is there a model out there that would be ideally suited for this kind of learning? Keen to dive into it further on my own but any recommendations as a starting point would be great. submitted by /u/premiumnougat [link] [comments]  ( 41 min )
    Create Your Chat GPT-3 Web App with Streamlit in Python
    submitted by /u/pasticciociccio [link] [comments]  ( 40 min )
    What do employers and job seekers need to know about artificial intelligence's role in hiring?
    University of Florida - Warrington College of Business's Mo Wang offers advice for the future of work. Full Story: https://explore.research.ufl.edu/the-future-of-work.html#ai-hiring submitted by /u/ufexplore [link] [comments]  ( 40 min )
    BuzzFeed to Use ChatGPT Creator OpenAI to Help Create Quizzes and Other Content
    submitted by /u/trueslicky [link] [comments]  ( 40 min )
    Member of Congress Reads AI-Generated Speech on House Floor
    submitted by /u/dahmedahe [link] [comments]  ( 6 min )
    Synthesizing the Businessmen-Smile
    submitted by /u/walt74 [link] [comments]  ( 45 min )
    AI Dream 150 - ENTERING DREAMWORLD Part2 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    NIST Risk Management Framework Aims to Improve Trustworthiness of Artificial Intelligence
    submitted by /u/Harley109 [link] [comments]  ( 40 min )
    📌[Searchcolab] It's impressive to see how far generative AI has come in the past 5 years. What should we expect the trajectory of this field to be in the next 5 years? Btw the pictures attached are also generated using AI.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    "Father" from Equilibrium movie
    Imagine we could create an AGI/ASI that will protect our values. Something like "Father" from Equilibrium (but that's a dystopian version). Let's call it modern God. What values should it protect? submitted by /u/chuguruk [link] [comments]  ( 40 min )
    InstructPix2Pix lets you edit images using only text prompts
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    what are your favorite AI subreddit?
    submitted by /u/Chaserivx [link] [comments]  ( 40 min )
    AI "Upscale" With Only 1000 Training Examples(All examples were dogs)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 43 min )
    Proud Pollution A movie script Written by AI
    Movie script: Proud Pollution The website used- (https://www.plot-generator.org.uk/) Be free to share your thoughts on it! ​ Proud Pollution A Screenplay by Mr. Pseudonym EXT. VASQUEZ ROCKS, CALIFORNIA - AFTERNOON Misunderstood piolet FLAMOUS JACK THORNTON is arguing with mean scout MISS HELEN FISH. JACK tries to hug HELEN but she shakes him off. JACK Please, Helen, don't leave me. HELEN I'm sorry Jack, but I'm looking for somebody a bit braver. Somebody who faces his fears head on, inhead-onstead of running away. JACK I am such a person! HELEN frowns. HELEN I'm sorry, Jack. I just don't feel excited by this relationship anymore. HELEN leaves. JACK sits down, looking defeated. Moments later, noble navigator MASTER CUTHBERT MACDONALD barges in looking flustered. JACK Go…  ( 47 min )
    Meta's chief AI scientist says "ChatGPT is not innovative".
    What happened? So, there has been a lot of excitement around OpenAI's ChatGPT which generates natural-language responses to human prompts. But what if... it's not as amazing as we all think it is? Yann LeCun, Meta's chief AI scientist, argues that the program is not innovative. He also states that similar technology has been developed by many companies and research labs, and that ChatGPT is composed of multiple pieces of technology developed over many years by many parties. (sounds salty to me... and I like my cookies sweet!) But maybe Yann has a point! What's happening now? ChatGPT is perceived by many as a unique and innovative program. People are using it everyday to make their lives easier. So, no matter what, ChatGPT is still awesome. What's happening next? It's unclear what will happen next in terms of the development and perception of ChatGPT. However, it can be expected that as AI technology continues to evolve, there will be further advancements on what is possible with this tech. It's likely that ChatGPT will face many spin-offs and competitors this year. If you enjoyed this and want 500+ AI tools, I write a daily AI newsletter: https://chriscookies.beehiiv.com/p/metas-chief-ai-scientist-says-chatgpt-not-innovative-7581 submitted by /u/ZaKodiak [link] [comments]  ( 41 min )
    Looking for some ideas to research about making money using AI!
    Hey yall, I'm looking for ideas of how to make money using language models like ChatGPT. I want to go in depth and research a bit as well as begin designing tutorials based on my experiences of what's the best way to make money. I am open to any suggestions or things that have worked for you all! Thanks! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    What is Atomic AI? - Is AI going to have a Drug development breakthrough soon?
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    AI Video to Fill Missing Frames/Smooth Animation?
    Hey all, Was wondering if you know some kind of AI tool that exists to fill missing frames and therefore smooth animation in animated videos? Trying to get something cleaned up and really nailed in. Thanks! submitted by /u/miseryleech [link] [comments]  ( 40 min )
    ChatGPT: OpenAI’s Last Resort Turns Out To Be A Winner
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    41 AI Written Articles Out Of 77 On CNET Have Plagiarism And Errors
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 6 min )
    Chrome Extension that uses AI to write emails.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I told an AI to freak out on camera. it was ALL made by AI.
    submitted by /u/25dopren [link] [comments]  ( 40 min )
  • Open

    [D] score based vs. Diffusion models
    I know there is a mathematical way to show that the two approaches of score matching models and diffusion models are the same. I wonder, if there in practice/code are the same either? I already tried to find some PyTorch implementations of score based models but didn’t find anything yet - just for diffusion models. submitted by /u/Individual-Cause-616 [link] [comments]  ( 43 min )
    [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation?
    I guess what I'm trying to figure out is, what are the main reasons that DMs are outperforming GANs in text2img generation? Thanks! submitted by /u/TheCockatoo [link] [comments]  ( 46 min )
    [P] A python module to generate optimized prompts & solve different NLP problems using GPT-n based models and return structured python object for easy parsing
    Hi folks, I was working on a personal experimental project related to GPT-3, which I thought of making it open source now. It saves much time while working with LLMs. If you are an industrial researcher or application developer, you probably have worked with GPT-3 apis. A common challenge when utilizing LLMs such as GPT-3 and BLOOM is their tendency to produce uncontrollable & unstructured outputs, making it difficult to use them for various NLP tasks and applications.To address this, we developed Promptify, a library that allows for the use of LLMs to solve NLP problems including Named Entity Recognition, Binary Classification, Multi-Label Classification, and Question-Answering and return a python object for easy parsing to construct additional applications on top of GPT-n based models. Features 🚀 🧙‍♀️ NLP Tasks (NER, Binary Text Classification, Multi-Label Classification etc) in 2 lines of code with no training data required 🔨 Easily add one shot, two shot, or few shot examples to the prompt ✌ Output always provided as a Python object (e.g. list, dictionary) for easy parsing and filtering 💥 Custom examples and samples can be easily added to the prompt 💰 Optimized prompts to reduce OpenAI token costs GITHUB: https://github.com/promptslab/Promptify Examples: https://github.com/promptslab/Promptify/tree/main/examples For quick demo -> Colab I hope it will be helpful in your research. Thanks :) NER example ​ https://preview.redd.it/vnz4mf0i6gea1.png?width=1398&format=png&auto=webp&s=74c70bd9d518423f913c1fb9c68cf2565cf8cffc submitted by /u/aadityaura [link] [comments]  ( 43 min )
    A Watermark for Large Language Models
    submitted by /u/lookinsidemybutthole [link] [comments]  ( 42 min )
    [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
    Dec 2022 paper from Microsoft research: https://arxiv.org/abs/2212.10559v2 Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. submitted by /u/currentscurrents [link] [comments]  ( 43 min )
    [Discussion] Github like alternative for ML?
    Versioning and collaboration on code for software engineers is a reasonably solved problem through GitHub since the task at hand predominantly involves just maintaining different copies of just simple vanilla code in different folders. On the other hand, ML engineers face the humungous task of maintaining different versions on not just code, but hyper parameters, data, models, data lineage and labels and storing this on GitHub currently does not allow you to track the changes on each variable well. What are the software/open source tools currently used for the same? Is their a space for a new company to be built here? submitted by /u/angkhandelwal749 [link] [comments]  ( 44 min )
    Are there any projects working at an open source version of Constitutional AI? [D]
    I'm looking into projects which augment the RLHF training approach of chatGPT with explicit rules, such as in https://paperswithcode.com/paper/constitutional-ai-harmlessness-from-ai. Ideally there would be both rules and priority levels between the rules, similarly to the Asimov laws of robotics. The Open-Assistant project (https://github.com/LAION-AI/Open-Assistant) captures the spirit, but it is looking to replicate chatGPT at the moment. submitted by /u/lorepieri [link] [comments]  ( 42 min )
    [D] Quantitative measure for smoothness of NLP autoencoder latent space
    I would like to measure the smoothness of an NLP-autoencoder's latent space. The idea is to sample two Gaussian vectors v1 and v2 in the latent space of the AE, and generate N-1 points between them like so: vi = v1 + (v2 - v1) / (N * i) My idea is to then decode these vectors and measure the BLEU score between d(vi) and d(vi+1) for all N-2 comparisons. Is this idea reasonable, do you have a better one? Is there a technique from AEs with images that can be useful here? submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    [D] What are some of your favorite ML research posters?
    And what are your own best practices when creating one (e.g. adding a QR code that links to the GitHub project or paper PDF)? submitted by /u/epistoteles [link] [comments]  ( 42 min )
    [D] Fastest and most accurate model for casing
    What is the state of the art regarding freely available casing models, i.e. DNNs, that try to restore the original casing of a text with uniform (either lowercase or capital letters) casing? I value both speed and accuracy, as I have to process a large corpus of text. submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    Few questions about scalability of chatGPT [D]
    I have two questions about chatGPT. I don't come from a machine learning background. I am just a programmer. So bear with me if they sound a bit dumb. I was checking about chatGPT a bit the last week. I went through their papers and also tried out a fine tuning by myself by creating some fictional world and giving it some examples. The first thing I wondered is what is very special about the model than the large data and parameter set it has, that other competitors can't do. I ask this because I have seen a lot of "google killer" discussions in some places. From what I understood from their papers I thought it is something another company with the computing power and the filtered data can have up and running in few months. I see their advantage in rolling out to the public because with feedbacks from actual users all over the world it can potentially be retrained. The second thing I wondered is its scalability. It feels to me that it is a very big challenge to keep it scalable in the future. Currently getting a long text out of it is kind of painful because it has to continuously generate. I think it is continuously calculating with the huge parameter set it has. I wonder also about new trends, if it needs to be retrained. I also used it for a fine tuning, where I created a fictional world with its own law and rules and the fine tuning took hours in the queue - so is it creating separate parameters for my case? that would be a lot considering how much parameter set they have. submitted by /u/besabestin [link] [comments]  ( 50 min )
    [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II.
    Find the release notes here: https://github.com/nnaisense/evotorch/releases/tag/v0.4.0 A big highlight is how fast these implementations are! I genuinely believe GPU-acceleration is the future of Evolutionary algorithms, and EvoTorch and its integration into the PyTorch ecosystem is a fantastic enabler for this. To demonstrate the raw speed provided by the new release, I compared EvoTorch's CMA-ES implementation to that provided by the popular pycma package on the 80-dimensional Rastrigin problem and tracked the run-time: Performance was measured over 50 runs on the 80-dimensional Rastrigin problem The crazy thing to note is that when we switch to GPU (Tesla V100), we can efficiently run CMA-ES with population sizes going into 100k+! submitted by /u/NaturalGradient [link] [comments]  ( 45 min )
    Machine learning and black box numerical solver[D]
    Anybody know some methods and techniques for integrating a numerical solver with the neural network .. how do you calculate the gradients of the solver when you don’t know the details of such solver- black box solver. submitted by /u/Due-Wall-915 [link] [comments]  ( 43 min )
    [P] Diffusion models best practices
    I'm about to start an experimental project that involves training a denoising diffusion model on the medical data (small dataset). Could you please share useful resources, tips, tricks and heuristics for dealing with diffusion models? submitted by /u/debrises [link] [comments]  ( 42 min )
  • Open

    "Cheaters Hacked an AI Bot—and Beat the 'Rocket League' Elite"
    submitted by /u/gwern [link] [comments]  ( 40 min )
    Insights and learnings
    Hey all, I am part of an incubator and interested in building developer tooling for reinforcement learning. I would love to understand, from the RL community, what some of the biggest pinpoints are in developing and productionising RL agents. Would love to hear about your implementations too, if you are happy to share! submitted by /u/paramkumar1992 [link] [comments]  ( 41 min )
    Are there papers that do an empirical investigation on DRL hyperparameters?
    Could someone please help with this - https://ai.stackexchange.com/questions/38894/are-there-papers-that-do-an-empirical-investigation-on-drl-hyperparameters submitted by /u/Academic-Rent7800 [link] [comments]  ( 41 min )
    DQN application
    I want to train a DQN model in an off-policy fashion, where my behavior policy is an older agent. I have a big memory of a lot of episodes of this agent. Now I want to find a better policy using DQN. Now I am just wondering, in the "normal" DQN case you would use the experience replay buffer and would update behavior and target policy online (behavior not really online but with the time lag introduced after which these parameters are also updated). In my case, I already have all the experience and would like to learn from it. Do you think it makes sense to use the exact same procedure in this context, so sampling one new action, state and immediate reward, follow up action or could it be better here to use the fact that all experience is already stored to exchange the immediate reward + gamma*maxQ(s',a') with some more future information about the rewards (up to the point of Monte Carlo where you take G_t so the discounted cumulative reward seen during the episode from point t onwards)? submitted by /u/PatrickSVM [link] [comments]  ( 42 min )
    "Imitating Human Behaviour with Diffusion Models", Pearce et al 2023 {MS}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning", Wang et al 2022 {Twitter}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Learning with Queried Hints" [on "Online Learning and Bandits with Queried Hints", Bhaskara et al 2022 {G}]
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Why can't my neural network recognize my own digits, but it has 97% accuracy on mnist test samples?
    I've been learning about neural networks for the past few days and I decided to write my own in Python. To keep it simple, I didn't write code for convolution layers and such, only fully connected layers with logistic activation function. I first trained it to do XOR with a 2 -> 2 -> 1 layers layout and it worked. Then I tried to train a 28*28 -> 100 -> 10 network on the MNIST digit dataset to recognize digits. When running it on the test samples, the accuracy was 97%, but when running on my own samples it barely ever manages to get it right. Does anyone have any idea on why this would happen? submitted by /u/FidgetSpinzz [link] [comments]  ( 42 min )
    Why add bias instead of subtracting bias?
    Pretty much the title, why do we add the bias instead of subtracting? Also when i watched 3blue1browns video about neural networks nad he said that you subtract the with the bias, but other sources tell me or explain that you simply add the bias in the dot product instead of subtracting. //Newbie submitted by /u/ArthurLCTTheCool [link] [comments]  ( 41 min )
    Machine Learning Framework with Neural Networks for Java
    Hey guys, a buddy of mine and me created a Framework for Machine Learning in Java. It provides the possibilities to train the neural Networks via backpropagation but we also implemented a Genetic Algorithm which can also be used for the training. The project is intended to be used by people who only learn Java in school and want to try out ML without the need of learning Python or complex Java Libraries. It's designed to be easy to use and to be played around with. Qualifications needed to use: basic Java understanding A brief understanding of what Machine Learning is Here is a step-by-step tutorial how to predict diabetes with this framework: https://easy-ml.gitbook.io/easy-ml-for-java/fundamentals/implement-your-first-ai (This example is using the genetic Algorithm, there is already one example in the source code published using the Backpropagation approach but the tutorial for it is gonna follow in the next few days) Please also look at the GitHub repository and leave some feedback about code and design. (Especially considering the ReadMe) https://github.com/tomLamprecht/Easy-ML-For-Java https://easy-ml.gitbook.io/easy-ml-for-java/ (Doc) Right now, I'm working on adding Convolutional Neural Networks as well. Feel free to also check open issues on our GitHub if you want to contribute! :) Thanks so much, and also especially for those who contributed to our project with pull requests. PS: we earn no cent with this project, and we just do it for the experience. So feedback is basically our payment :D (and ofc stars on GitHub hehe) submitted by /u/Lampard557 [link] [comments]  ( 42 min )
  • Open

    Best Egg achieved three times faster ML model training with Amazon SageMaker Automatic Model Tuning
    This post is co-authored by Tristan Miller from Best Egg. Best Egg is a leading financial confidence platform that provides lending products and resources focused on helping people feel more confident as they manage their everyday finances. Since March 2014, Best Egg has delivered $22 billion in consumer personal loans with strong credit performance, welcomed […]  ( 8 min )
  • Open

    What Are Large Language Models Used For?
    AI applications are summarizing articles, writing stories and engaging in long conversations — and large language models are doing the heavy lifting. A large language model, or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. Large language models Read article >  ( 7 min )
    DLSS 3 Delivers Ultimate Boost in Latest Game Updates on GeForce NOW
    GeForce NOW RTX 4080 SuperPODs are rolling out now, bringing RTX 4080-class performance and features to Ultimate members — including support for NVIDIA Ada Lovelace GPU architecture technologies like NVIDIA DLSS 3.  This GFN Thursday brings updates to some of GeForce NOW’s hottest games that take advantage of these amazing technologies, all from the cloud. Read article >  ( 6 min )
  • Open

    Remove algorithmic filters from what you read
    I typically announce new blog posts from my most relevant twitter account: data science from @DataSciFact, algebra and miscellaneous math from @AlgebraFact, TeX and typography from @TeXtip, etc. If you’d like to be sure that you’re notified of each post, regardless of what algorithms Twitter applies to your feed, you can subscribe to this blog […] Remove algorithmic filters from what you read first appeared on John D. Cook.  ( 5 min )
  • Open

    Robustness through Data Augmentation Loss Consistency. (arXiv:2110.11205v3 [cs.LG] UPDATED)
    While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. For example, dialog state covaries with named entity when we augment data with a new named entity. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. Our code of all experiments is available at: https://github.com/optimization-for-data-driven-science/DAIR.git  ( 3 min )
    Neuronal architecture extracts statistical temporal patterns. (arXiv:2301.10203v1 [q-bio.NC])
    Neuronal systems need to process temporal signals. We here show how higher-order temporal (co-)fluctuations can be employed to represent and process information. Concretely, we demonstrate that a simple biologically inspired feedforward neuronal model is able to extract information from up to the third order cumulant to perform time series classification. This model relies on a weighted linear summation of synaptic inputs followed by a nonlinear gain function. Training both - the synaptic weights and the nonlinear gain function - exposes how the non-linearity allows for the transfer of higher order correlations to the mean, which in turn enables the synergistic use of information encoded in multiple cumulants to maximize the classification accuracy. The approach is demonstrated both on a synthetic and on real world datasets of multivariate time series. Moreover, we show that the biologically inspired architecture makes better use of the number of trainable parameters as compared to a classical machine-learning scheme. Our findings emphasize the benefit of biological neuronal architectures, paired with dedicated learning algorithms, for the processing of information embedded in higher-order statistical cumulants of temporal (co-)fluctuations.  ( 2 min )
    Gradient-adjusted Incremental Target Propagation Provides Effective Credit Assignment in Deep Neural Networks. (arXiv:2102.11598v3 [cs.LG] UPDATED)
    Many of the recent advances in the field of artificial intelligence have been fueled by the highly successful backpropagation of error (BP) algorithm, which efficiently solves the credit assignment problem in artificial neural networks. However, it is unlikely that BP is implemented in its usual form within biological neural networks, because of its reliance on non-local information in propagating error gradients. Since biological neural networks are capable of highly efficient learning and responses from BP trained models can be related to neural responses, it seems reasonable that a biologically viable approximation of BP underlies synaptic plasticity in the brain. Gradient-adjusted incremental target propagation (GAIT-prop or GP for short) has recently been derived directly from BP and has been shown to successfully train networks in a more biologically plausible manner. However, so far, GP has only been shown to work on relatively low-dimensional problems, such as handwritten-digit recognition. This work addresses some of the scaling issues in GP and shows it to perform effective multi-layer credit assignment in deeper networks and on the much more challenging ImageNet dataset.  ( 2 min )
    Denoising Diffusion Probabilistic Models for Generation of Realistic Fully-Annotated Microscopy Image Data Sets. (arXiv:2301.10227v1 [eess.IV])
    Denoising diffusion probabilistic models have shown great potential in generating realistic image data. We show how those models can be used to generate realistic microscopy image data in 2D and 3D based on simulated sketches of cellular structures. Multiple data sets are used as an inspiration to simulate sketches of different cellular structures, allowing to generate fully-annotated image data sets without requiring human interactions. Those data sets are used to train segmentation approaches and demonstrate that annotation-free segmentation of cellular structures in fluorescence microscopy image data can be achieved, thereby leaping towards the ultimate goal of eliminating the necessity of human annotation efforts.  ( 2 min )
    CovidRhythm: A Deep Learning Model for Passive Prediction of Covid-19 using Biobehavioral Rhythms Derived from Wearable Physiological Data. (arXiv:2301.10168v1 [eess.SP])
    To investigate whether a deep learning model can detect Covid-19 from disruptions in the human body's physiological (heart rate) and rest-activity rhythms (rhythmic dysregulation) caused by the SARS-CoV-2 virus. We propose CovidRhythm, a novel Gated Recurrent Unit (GRU) Network with Multi-Head Self-Attention (MHSA) that combines sensor and rhythmic features extracted from heart rate and activity (steps) data gathered passively using consumer-grade smart wearable to predict Covid-19. A total of 39 features were extracted (standard deviation, mean, min/max/avg length of sedentary and active bouts) from wearable sensor data. Biobehavioral rhythms were modeled using nine parameters (mesor, amplitude, acrophase, and intra-daily variability). These features were then input to CovidRhythm for predicting Covid-19 in the incubation phase (one day before biological symptoms manifest). A combination of sensor and biobehavioral rhythm features achieved the highest AUC-ROC of 0.79 [Sensitivity = 0.69, Specificity=0.89, F$_{0.1}$ = 0.76], outperforming prior approaches in discriminating Covid-positive patients from healthy controls using 24 hours of historical wearable physiological. Rhythmic features were the most predictive of Covid-19 infection when utilized either alone or in conjunction with sensor features. Sensor features predicted healthy subjects best. Circadian rest-activity rhythms that combine 24h activity and sleep information were the most disrupted. CovidRhythm demonstrates that biobehavioral rhythms derived from consumer-grade wearable data can facilitate timely Covid-19 detection. To the best of our knowledge, our work is the first to detect Covid-19 using deep learning and biobehavioral rhythms features derived from consumer-grade wearable data.  ( 2 min )
    Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis. (arXiv:2301.10183v1 [cs.SD])
    Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in deep learning. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure: i.e., local amplitude variations up to 100 milliseconds or so. In this paper, we formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. We empirically demonstrate that time--frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, we motivate the need for a time-invariant and multiscale differentiable time--frequency model of similarity at the level of both local spectra and spectrotemporal modulations.  ( 2 min )
    On the Tradeoff between Energy, Precision, and Accuracy in Federated Quantized Neural Networks. (arXiv:2111.07911v3 [cs.LG] UPDATED)
    Deploying federated learning (FL) over wireless networks with resource-constrained devices requires balancing between accuracy, energy efficiency, and precision. Prior art on FL often requires devices to train deep neural networks (DNNs) using a 32-bit precision level for data representation to improve accuracy. However, such algorithms are impractical for resource-constrained devices since DNNs could require execution of millions of operations. Thus, training DNNs with a high precision level incurs a high energy cost for FL. In this paper, a quantized FL framework, that represents data with a finite level of precision in both local training and uplink transmission, is proposed. Here, the finite level of precision is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with the quantization are rigorously derived. An energy minimization problem is formulated with respect to the level of precision while ensuring convergence. To solve the problem, we first analytically derive the FL convergence rate and use a line search method. Simulation results show that our FL framework can reduce energy consumption by up to 53% compared to a standard FL model. The results also shed light on the tradeoff between precision, energy, and accuracy in FL over wireless networks.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization. (arXiv:2301.10133v1 [cs.LG])
    We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $\alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not. This sign-conscious algorithm is aware of whether from the previous step to the current one the update of each parameter has been too large or too small and adjusts the $\alpha$ accordingly. We implement the Active version (ours) of widely used and recently published gradient descent optimizers, namely SGD with momentum, AdamW, RAdam, and AdaBelief. Our experiments on ImageNet, CIFAR-10, WikiText-103, WikiText-2, and PASCAL VOC using different model architectures, such as ResNet and Transformers, show an increase in generalizability and training set fit, and decrease in training time for the Active variants of the tested optimizers. The results also show robustness of the Active variant of these optimizers to different values of the initial learning rate. Furthermore, the detrimental effects of using large mini-batch sizes are mitigated. ActiveLR, thus, alleviates the need for hyper-parameter search for two of the most commonly tuned hyper-parameters that require heavy time and computational costs to pick. We encourage AI researchers and practitioners to use the Active variant of their optimizer of choice for faster training, better generalizability, and reducing carbon footprint of training deep neural networks.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Analysis of Arrhythmia Classification on ECG Dataset. (arXiv:2301.10174v1 [cs.LG])
    The heart is one of the most vital organs in the human body. It supplies blood and nutrients in other parts of the body. Therefore, maintaining a healthy heart is essential. As a heart disorder, arrhythmia is a condition in which the heart's pumping mechanism becomes aberrant. The Electrocardiogram is used to analyze the arrhythmia problem from the ECG signals because of its fewer difficulties and cheapness. The heart peaks shown in the ECG graph are used to detect heart diseases, and the R peak is used to analyze arrhythmia disease. Arrhythmia is grouped into two groups - Tachycardia and Bradycardia for detection. In this paper, we discussed many different techniques such as Deep CNNs, LSTM, SVM, NN classifier, Wavelet, TQWT, etc., that have been used for detecting arrhythmia using various datasets throughout the previous decade. This work shows the analysis of some arrhythmia classification on the ECG dataset. Here, Data preprocessing, feature extraction, classification processes were applied on most research work and achieved better performance for classifying ECG signals to detect arrhythmia. Automatic arrhythmia detection can help cardiologists make the right decisions immediately to save human life. In addition, this research presents various previous research limitations with some challenges in detecting arrhythmia that will help in future research.  ( 2 min )
    Sleep Activity Recognition and Characterization from Multi-Source Passively Sensed Data. (arXiv:2301.10156v1 [eess.SP])
    Sleep constitutes a key indicator of human health, performance, and quality of life. Sleep deprivation has long been related to the onset, development, and worsening of several mental and metabolic disorders, constituting an essential marker for preventing, evaluating, and treating different health conditions. Sleep Activity Recognition methods can provide indicators to assess, monitor, and characterize subjects' sleep-wake cycles and detect behavioral changes. In this work, we propose a general method that continuously operates on passively sensed data from smartphones to characterize sleep and identify significant sleep episodes. Thanks to their ubiquity, these devices constitute an excellent alternative data source to profile subjects' biorhythms in a continuous, objective, and non-invasive manner, in contrast to traditional sleep assessment methods that usually rely on intrusive and subjective procedures. A Heterogeneous Hidden Markov Model is used to model a discrete latent variable process associated with the Sleep Activity Recognition task in a self-supervised way. We validate our results against sleep metrics reported by tested wearables, proving the effectiveness of the proposed approach and advocating its use to assess sleep without more reliable sources.  ( 2 min )
    EEG Opto-processor: epileptic seizure detection using diffractive photonic computing units. (arXiv:2301.10167v1 [eess.SP])
    Electroencephalography (EEG) analysis extracts critical information from brain signals, which has provided fundamental support for various applications, including brain-disease diagnosis and brain-computer interface. However, the real-time processing of large-scale EEG signals at high energy efficiency has placed great challenges for electronic processors on edge computing devices. Here, we propose the EEG opto-processor based on diffractive photonic computing units (DPUs) to effectively process the extracranial and intracranial EEG signals and perform epileptic seizure detection. The signals of EEG channels within a second-time window are optically encoded as inputs to the constructed diffractive neural networks for classification, which monitors the brain state to determine whether it's the symptom of an epileptic seizure or not. We developed both the free-space and integrated DPUs as edge computing systems and demonstrated their applications for real-time epileptic seizure detection with the benchmark datasets, i.e., the CHB-MIT extracranial EEG dataset and Epilepsy-iEEG-Multicenter intracranial EEG dataset, at high computing performance. Along with the channel selection mechanism, both the numerical evaluations and experimental results validated the sufficient high classification accuracies of the proposed opto-processors for supervising the clinical diagnosis. Our work opens up a new research direction of utilizing photonic computing techniques for processing large-scale EEG signals in promoting its broader applications.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    VaiPhy: a Variational Inference Based Algorithm for Phylogeny. (arXiv:2203.01121v3 [q-bio.PE] UPDATED)
    Phylogenetics is a classical methodology in computational biology that today has become highly relevant for medical investigation of single-cell data, e.g., in the context of cancer development. The exponential size of the tree space is, unfortunately, a substantial obstacle for Bayesian phylogenetic inference using Markov chain Monte Carlo based methods since these rely on local operations. And although more recent variational inference (VI) based methods offer speed improvements, they rely on expensive auto-differentiation operations for learning the variational parameters. We propose VaiPhy, a remarkably fast VI based algorithm for approximate posterior inference in an augmented tree space. VaiPhy produces marginal log-likelihood estimates on par with the state-of-the-art methods on real data and is considerably faster since it does not require auto-differentiation. Instead, VaiPhy combines coordinate ascent update equations with two novel sampling schemes: (i) SLANTIS, a proposal distribution for tree topologies in the augmented tree space, and (ii) the JC sampler, to the best of our knowledge, the first-ever scheme for sampling branch lengths directly from the popular Jukes-Cantor model. We compare VaiPhy in terms of density estimation and runtime. Additionally, we evaluate the reproducibility of the baselines. We provide our code on GitHub: \url{https://github.com/Lagergren-Lab/VaiPhy}.  ( 2 min )
    Lowering Detection in Sport Climbing Based on Orientation of the Sensor Enhanced Quickdraw. (arXiv:2301.10164v1 [eess.SP])
    Tracking climbers' activity to improve services and make the best use of their infrastructure is a concern for climbing gyms. Each climbing session must be analyzed from beginning till lowering of the climber. Therefore, spotting the climbers descending is crucial since it indicates when the ascent has come to an end. This problem must be addressed while preserving privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence become practical in terms of expenses and time consumption for replacement when using in large quantity in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect sensors' orientation patterns during lowering different routes, and develop an supervised approach to identify lowering.  ( 2 min )
    Sequential Graph Attention Learning for Predicting Dynamic Stock Trends (Student Abstract). (arXiv:2301.10153v1 [q-fin.ST])
    The stock market is characterized by a complex relationship between companies and the market. This study combines a sequential graph structure with attention mechanisms to learn global and local information within temporal time. Specifically, our proposed "GAT-AGNN" module compares model performance across multiple industries as well as within single industries. The results show that the proposed framework outperforms the state-of-the-art methods in predicting stock trends across multiple industries on Taiwan Stock datasets.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    Computational Solar Energy -- Ensemble Learning Methods for Prediction of Solar Power Generation based on Meteorological Parameters in Eastern India. (arXiv:2301.10159v1 [cs.LG])
    The challenges in applications of solar energy lies in its intermittency and dependency on meteorological parameters such as; solar radiation, ambient temperature, rainfall, wind-speed etc., and many other physical parameters like dust accumulation etc. Hence, it is important to estimate the amount of solar photovoltaic (PV) power generation for a specific geographical location. Machine learning (ML) models have gained importance and are widely used for prediction of solar power plant performance. In this paper, the impact of weather parameters on solar PV power generation is estimated by several Ensemble ML (EML) models like Bagging, Boosting, Stacking, and Voting for the first time. The performance of chosen ML algorithms is validated by field dataset of a 10kWp solar PV power plant in Eastern India region. Furthermore, a complete test-bed framework has been designed for data mining as well as to select appropriate learning models. It also supports feature selection and reduction for dataset to reduce space and time complexity of the learning models. The results demonstrate greater prediction accuracy of around 96% for Stacking and Voting EML models. The proposed work is a generalized one and can be very useful for predicting the performance of large-scale solar PV power plants also.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Pex: Memory-efficient Microcontroller Deep Learning through Partial Execution. (arXiv:2211.17246v2 [cs.LG] UPDATED)
    Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.
    Multi-Agent Patrolling with Battery Constraints through Deep Reinforcement Learning. (arXiv:2212.08230v2 [cs.AI] UPDATED)
    Autonomous vehicles are suited for continuous area patrolling problems. However, finding an optimal patrolling strategy can be challenging for many reasons. Firstly, patrolling environments are often complex and can include unknown environmental factors. Secondly, autonomous vehicles can have failures or hardware constraints, such as limited battery life. Importantly, patrolling large areas often requires multiple agents that need to collectively coordinate their actions. In this work, we consider these limitations and propose an approach based on model-free, deep multi-agent reinforcement learning. In this approach, the agents are trained to automatically recharge themselves when required, to support continuous collective patrolling. A distributed homogeneous multi-agent architecture is proposed, where all patrolling agents execute identical policies locally based on their local observations and shared information. This architecture provides a fault-tolerant and robust patrolling system that can tolerate agent failures and allow supplementary agents to be added to replace failed agents or to increase the overall patrol performance. The solution is validated through simulation experiments from multiple perspectives, including the overall patrol performance, the efficiency of battery recharging strategies, and the overall fault tolerance and robustness.
    Dirac signal processing of higher-order topological signals. (arXiv:2301.10137v1 [eess.SP])
    We consider topological signals corresponding to variables supported on nodes, links and triangles of higher-order networks and simplicial complexes. So far such signals are typically processed independently of each other, and algorithms that can enforce a consistent processing of topological signals across different levels are largely lacking. Here we propose Dirac signal processing, an adaptive, unsupervised signal processing algorithm that learns to jointly filter topological signals supported on nodes, links and (filled) triangles of simplicial complexes in a consistent way. The proposed Dirac signal processing algorithm is rooted in algebraic topology and formulated in terms of the discrete Dirac operator which can be interpreted as ``square root" of a higher-order (Hodge) Laplacian matrix acting on nodes, links and triangles of simplicial complexes. We test our algorithms on noisy synthetic data and noisy data of drifters in the ocean and find that the algorithm can learn to efficiently reconstruct the true signals outperforming algorithms based exclusively on the Hodge Laplacian.
    Towards Asteroid Detection in Microlensing Surveys with Deep Learning. (arXiv:2211.02239v2 [astro-ph.EP] UPDATED)
    Asteroids are an indelible part of most astronomical surveys though only a few surveys are dedicated to their detection. Over the years, high cadence microlensing surveys have amassed several terabytes of data while scanning primarily the Galactic Bulge and Magellanic Clouds for microlensing events and thus provide a treasure trove of opportunities for scientific data mining. In particular, numerous asteroids have been observed by visual inspection of selected images. This paper presents novel deep learning-based solutions for the recovery and discovery of asteroids in the microlensing data gathered by the MOA project. Asteroid tracklets can be clearly seen by combining all the observations on a given night and these tracklets inform the structure of the dataset. Known asteroids were identified within these composite images and used for creating the labelled datasets required for supervised learning. Several custom CNN models were developed to identify images with asteroid tracklets. Model ensembling was then employed to reduce the variance in the predictions as well as to improve the generalisation error, achieving a recall of 97.67%. Furthermore, the YOLOv4 object detector was trained to localize asteroid tracklets, achieving a mean Average Precision (mAP) of 90.97%. These trained networks will be applied to 16 years of MOA archival data to find both known and unknown asteroids that have been observed by the survey over the years. The methodologies developed can be adapted for use by other surveys for asteroid recovery and discovery.
    Neural Implicit k-Space for Binning-free Non-Cartesian Cardiac MR Imaging. (arXiv:2212.08479v2 [eess.IV] UPDATED)
    In this work, we propose a novel image reconstruction framework that directly learns a neural implicit representation in k-space for ECG-triggered non-Cartesian Cardiac Magnetic Resonance Imaging (CMR). While existing methods bin acquired data from neighboring time points to reconstruct one phase of the cardiac motion, our framework allows for a continuous, binning-free, and subject-specific k-space representation.We assign a unique coordinate that consists of time, coil index, and frequency domain location to each sampled k-space point. We then learn the subject-specific mapping from these unique coordinates to k-space intensities using a multi-layer perceptron with frequency domain regularization. During inference, we obtain a complete k-space for Cartesian coordinates and an arbitrary temporal resolution. A simple inverse Fourier transform recovers the image, eliminating the need for density compensation and costly non-uniform Fourier transforms for non-Cartesian data. This novel imaging framework was tested on 42 radially sampled datasets from 6 subjects. The proposed method outperforms other techniques qualitatively and quantitatively using data from four and one heartbeat(s) and 30 cardiac phases. Our results for one heartbeat reconstruction of 50 cardiac phases show improved artifact removal and spatio-temporal resolution, leveraging the potential for real-time CMR.
    A Learning Based Hypothesis Test for Harmful Covariate Shift. (arXiv:2212.02742v3 [cs.LG] UPDATED)
    The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.
    Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. (arXiv:2210.10246v2 [cs.LG] UPDATED)
    Training deep learning models can be computationally expensive. Prior works have shown that increasing the batch size can potentially lead to better overall throughput. However, the batch size is frequently limited by the accelerator memory capacity due to the activations/feature maps stored for the training backward pass, as larger batch sizes require larger feature maps to be stored. Transformer-based models, which have recently seen a surge in popularity due to their good performance and applicability to a variety of tasks, have a similar problem. To remedy this issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU) memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage and ultimately leading to more efficient training. We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19% and 26% speedup over the baseline.
    Quadruple-star systems are not always nested triples: a machine learning approach to dynamical stability. (arXiv:2301.09930v1 [cs.LG])
    The dynamical stability of quadruple-star systems has traditionally been treated as a problem involving two `nested' triples which constitute a quadruple. In this novel study, we employed a machine learning algorithm, the multi-layer perceptron (MLP), to directly classify 2+2 and 3+1 quadruples based on their stability (or long-term boundedness). The training data sets for the classification, comprised of $5\times10^5$ quadruples each, were integrated using the highly accurate direct $N$-body code MSTAR. We also carried out a limited parameter space study of zero-inclination systems to directly compare quadruples to triples. We found that both our quadruple MLP models perform better than a `nested' triple MLP approach, which is especially significant for 3+1 quadruples. The classification accuracies for the 2+2 MLP and 3+1 MLP models are 94% and 93% respectively, while the scores for the `nested' triple approach are 88% and 66% respectively. This is a crucial implication for quadruple population synthesis studies. Our MLP models, which are very simple and almost instantaneous to implement, are available on GitHub, along with Python3 scripts to access them.
    Exploring Effects of Computational Parameter Changes to Image Recognition Systems. (arXiv:2211.00471v3 [cs.LG] UPDATED)
    Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and FPGAs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment as parameters like deep learning frameworks, compiler optimizations for code generation, and hardware devices are not regulated with varying impact on model performance and correctness. In this paper we conduct robustness analysis of four popular image recognition models (MobileNetV2, ResNet101V2, DenseNet121 and InceptionV3) with the ImageNet dataset, assessing the impact of the following parameters in the model's computational environment: (1) deep learning frameworks; (2) compiler optimizations; and (3) hardware devices. We report sensitivity of model performance in terms of output label and inference time for changes in each of these environment parameters. We find that output label predictions for all four models are sensitive to choice of deep learning framework (by up to 57%) and insensitive to other parameters. On the other hand, model inference time was affected by all environment parameters with changes in hardware device having the most effect. The extent of effect was not uniform across models.
    Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized inference for the Work Disability Functional Assessment Battery. (arXiv:2210.10952v2 [stat.ME] UPDATED)
    The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means -- linear factorization and null hypothesis statistical testing for item partitioning/selection, and finally, posthoc calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. Its item partitioning, based on exploratory factor analysis, is blind to the final nonlinear IRT model and is not performed in a manner consistent with goodness of fit to the final model. In this manuscript, we develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method uses sparsity-based shrinkage to obviate the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that item partitioning is consistent with the ultimate nonlinear factor model. We also analogize our multidimensional IRT model to probabilistic autoencoders, specifying an encoder function that amortizes the inference of ability parameters from item responses. The encoder function is equivalent to the "VBE" step in a stochastic variational Bayesian expectation maximization (VBEM) procedure that we use for approxiamte Bayesian inference on the entire model. We use the method on a sample of WD-FAB item responses and compare the resulting item discriminations to those obtained using the traditional posthoc method.
    Unsupervised Model Selection for Time-series Anomaly Detection. (arXiv:2210.01078v3 [cs.LG] UPDATED)
    Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the $F_1$ score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.
    Broken Neural Scaling Laws. (arXiv:2210.14891v5 [cs.LG] UPDATED)
    We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws
    Tracking the industrial growth of modern China with high-resolution panchromatic imagery: A sequential convolutional approach. (arXiv:2301.09620v1 [cs.CV] CROSS LISTED)
    Due to insufficient or difficult to obtain data on development in inaccessible regions, remote sensing data is an important tool for interested stakeholders to collect information on economic growth. To date, no studies have utilized deep learning to estimate industrial growth at the level of individual sites. In this study, we harness high-resolution panchromatic imagery to estimate development over time at 419 industrial sites in the People's Republic of China using a multi-tier computer vision framework. We present two methods for approximating development: (1) structural area coverage estimated through a Mask R-CNN segmentation algorithm, and (2) imputing development directly with visible & infrared radiance from the Visible Infrared Imaging Radiometer Suite (VIIRS). Labels generated from these methods are comparatively evaluated and tested. On a dataset of 2,078 50 cm resolution images spanning 19 years, the results indicate that two dimensions of industrial development can be estimated using high-resolution daytime imagery, including (a) the total square meters of industrial development (average error of 0.021 $\textrm{km}^2$), and (b) the radiance of lights (average error of 9.8 $\mathrm{\frac{nW}{cm^{2}sr}}$). Trend analysis of the techniques reveal estimates from a Mask R-CNN-labeled CNN-LSTM track ground truth measurements most closely. The Mask R-CNN estimates positive growth at every site from the oldest image to the most recent, with an average change of 4,084 $\textrm{m}^2$.
    Self-Supervised Learning Through Efference Copies. (arXiv:2210.09224v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) methods aim to exploit the abundance of unlabelled data for machine learning (ML), however the underlying principles are often method-specific. An SSL framework derived from biological first principles of embodied learning could unify the various SSL methods, help elucidate learning in the brain, and possibly improve ML. SSL commonly transforms each training datapoint into a pair of views, uses the knowledge of this pairing as a positive (i.e. non-contrastive) self-supervisory sign, and potentially opposes it to unrelated, (i.e. contrastive) negative examples. Here, we show that this type of self-supervision is an incomplete implementation of a concept from neuroscience, the Efference Copy (EC). Specifically, the brain also transforms the environment through efference, i.e. motor commands, however it sends to itself an EC of the full commands, i.e. more than a mere SSL sign. In addition, its action representations are likely egocentric. From such a principled foundation we formally recover and extend SSL methods such as SimCLR, BYOL, and ReLIC under a common theoretical framework, i.e. Self-supervision Through Efference Copies (S-TEC). Empirically, S-TEC restructures meaningfully the within- and between-class representations. This manifests as improvement in recent strong SSL baselines in image classification, segmentation, object detection, and in audio. These results hypothesize a testable positive influence from the brain's motor outputs onto its sensory representations.
    Foresight -- Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs. (arXiv:2212.08072v2 [cs.CL] UPDATED)
    Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generative transformers can be used to forecast a wide range of future disorders, substances, procedures or findings. Methods: We present Foresight, a novel transformer-based pipeline that uses named entity recognition and linking tools to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, substances, procedures and findings. We processed the entire free-text portion from three different hospital datasets totalling 811336 patients covering both physical and mental health. Findings: On tests in two UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 0.68, 0.76 and 0.88 was achieved for forecasting the next disorder in a patient timeline, while precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by five clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. As a generative model, it can forecast follow-on biomedical concepts for as many steps as required. Interpretation: Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk forecasting, virtual trials and clinical research to study the progression of disorders, simulate interventions and counterfactuals, and educational purposes.
    ESTAS: Effective and Stable Trojan Attacks in Self-supervised Encoders with One Target Unlabelled Sample. (arXiv:2211.10908v2 [cs.CV] UPDATED)
    Emerging self-supervised learning (SSL) has become a popular image representation encoding method to obviate the reliance on labeled data and learn rich representations from large-scale, ubiquitous unlabelled data. Then one can train a downstream classifier on top of the pre-trained SSL image encoder with few or no labeled downstream data. Although extensive works show that SSL has achieved remarkable and competitive performance on different downstream tasks, its security concerns, e.g, Trojan attacks in SSL encoders, are still not well-studied. In this work, we present a novel Trojan Attack method, denoted by ESTAS, that can enable an effective and stable attack in SSL encoders with only one target unlabeled sample. In particular, we propose consistent trigger poisoning and cascade optimization in ESTAS to improve attack efficacy and model accuracy, and eliminate the expensive target-class data sample extraction from large-scale disordered unlabelled data. Our substantial experiments on multiple datasets show that ESTAS stably achieves > 99% attacks success rate (ASR) with one target-class sample. Compared to prior works, ESTAS attains > 30% ASR increase and > 8.3% accuracy improvement on average.
    Visual Simulation Software Demonstration for Quantum Multi-Drone Reinforcement Learning. (arXiv:2211.15375v2 [quant-ph] UPDATED)
    Quantum computing (QC) has received a lot of attention according to its light training parameter numbers and computational speeds by qubits. Moreover, various researchers have tried to enable quantum machine learning (QML) using QC, where there are also multifarious efforts to use QC to implement quantum multi-agent reinforcement learning (QMARL). Existing classical multi-agent reinforcement learning (MARL) using neural network features non-stationarity and uncertain properties due to its large number of parameters. Therefore, this paper presents a visual simulation software framework for a novel QMARL algorithm to control autonomous multi-drone systems to take advantage of QC. Our proposed QMARL framework accomplishes reasonable reward convergence and service quality performance with fewer trainable parameters than the classical MARL. Furthermore, QMARL shows more stable training results than existing MARL algorithms. Lastly, our proposed visual simulation software allows us to analyze the agents' training process and results.
    Green, Quantized Federated Learning over Wireless Networks: An Energy-Efficient Design. (arXiv:2207.09387v2 [cs.LG] UPDATED)
    In this paper, a green-quantized FL framework, which represents data with a finite precision level in both local training and uplink transmission, is proposed. Here, the finite precision level is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with quantization are rigorously derived. To minimize the energy consumption and the number of communication rounds simultaneously, a multi-objective optimization problem is formulated with respect to the number of local iterations, the number of selected devices, and the precision levels for both local training and transmission while ensuring convergence under a target accuracy constraint. To solve this problem, the convergence rate of the proposed FL system is analytically derived with respect to the system control variables. Then, the Pareto boundary of the problem is characterized to provide efficient solutions using the normal boundary inspection method. Design insights on balancing the tradeoff between the two objectives while achieving a target accuracy are drawn from using the Nash bargaining solution and analyzing the derived convergence rate. Simulation results show that the proposed FL framework can reduce energy consumption until convergence by up to 70\% compared to a baseline FL algorithm that represents data with full precision without damaging the convergence rate.
    SIAN: Style-Guided Instance-Adaptive Normalization for Multi-Organ Histopathology Image Synthesis. (arXiv:2209.02412v2 [eess.IV] UPDATED)
    Existing deep neural networks for histopathology image synthesis cannot generate image styles that align with different organs, and cannot produce accurate boundaries of clustered nuclei. To address these issues, we propose a style-guided instance-adaptive normalization (SIAN) approach to synthesize realistic color distributions and textures for histopathology images from different organs. SIAN contains four phases, semantization, stylization, instantiation, and modulation. The first two phases synthesize image semantics and styles by using semantic maps and learned image style vectors. The instantiation module integrates geometrical and topological information and generates accurate nuclei boundaries. We validate the proposed approach on a multiple-organ dataset, Extensive experimental results demonstrate that the proposed method generates more realistic histopathology images than four state-of-the-art approaches for five organs. By incorporating synthetic images from the proposed approach to model training, an instance segmentation network can achieve state-of-the-art performance.
    RAIN: RegulArization on Input and Network for Black-Box Domain Adaptation. (arXiv:2208.10531v2 [cs.CV] UPDATED)
    Source-Free domain adaptation transits the source-trained model towards target domain without exposing the source data, trying to dispel these concerns about data privacy and security. However, this paradigm is still at risk of data leakage due to adversarial attacks on the source model. Hence, the Black-Box setting only allows to use the outputs of source model, but still suffers from overfitting on the source domain more severely due to source model's unseen weights. In this paper, we propose a novel approach named RAIN (RegulArization on Input and Network) for Black-Box domain adaptation from both input-level and network-level regularization. For the input-level, we design a new data augmentation technique as Phase MixUp, which highlights task-relevant objects in the interpolations, thus enhancing input-level regularization and class consistency for target models. For network-level, we develop a Subnetwork Distillation mechanism to transfer knowledge from the target subnetwork to the full target network via knowledge distillation, which thus alleviates overfitting on the source domain by learning diverse target representations. Extensive experiments show that our method achieves state-of-the-art performance on several cross-domain benchmarks under both single- and multi-source black-box domain adaptation.
    A computational framework for physics-informed symbolic regression with straightforward integration of domain knowledge. (arXiv:2209.06257v3 [cs.LG] UPDATED)
    Discovering a meaningful symbolic expression that explains experimental data is a fundamental challenge in many scientific fields. We present a novel, open-source computational framework called Scientist-Machine Equation Detector (SciMED), which integrates scientific discipline wisdom in a scientist-in-the-loop approach, with state-of-the-art symbolic regression (SR) methods. SciMED combines a wrapper selection method, that is based on a genetic algorithm, with automatic machine learning and two levels of SR methods. We test SciMED on five configurations of a settling sphere, with and without aerodynamic non-linear drag force, and with excessive noise in the measurements. We show that SciMED is sufficiently robust to discover the correct physically meaningful symbolic expressions from the data, and demonstrate how the integration of domain knowledge enhances its performance. Our results indicate better performance on these tasks than the state-of-the-art SR software packages , even in cases where no knowledge is integrated. Moreover, we demonstrate how SciMED can alert the user about possible missing features, unlike the majority of current SR systems.
    Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach. (arXiv:2109.08549v4 [cs.CY] UPDATED)
    Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
    Learning to Counter: Stochastic Feature-based Learning for Diverse Counterfactual Explanations. (arXiv:2209.13446v2 [cs.AI] UPDATED)
    Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One flourishing approach is through counterfactual explanations, which provide suggestions on what a user can do to alter an outcome. Not only must a counterfactual example counter the original prediction from the black-box classifier but it should also satisfy various constraints for practical applications. Diversity is one of the critical constraints that however remains less discussed. While diverse counterfactuals are ideal, it is computationally challenging to simultaneously address some other constraints. Furthermore, there is a growing privacy concern over the released counterfactual data. To this end, we propose a feature-based learning framework that effectively handles the counterfactual constraints and contributes itself to the limited pool of private explanation models. We demonstrate the flexibility and effectiveness of our method in generating diverse counterfactuals of actionability and plausibility. Our counterfactual engine is more efficient than counterparts of the same capacity while yielding the lowest re-identification risks.
    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.
    Can large language models reason about medical questions?. (arXiv:2207.08143v3 [cs.CL] UPDATED)
    Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether GPT-3.5 (Codex and InstructGPT) can be applied to answer and reason about difficult real-world-based questions. We utilize two multiple-choice medical exam questions (USMLE and MedMCQA) and a medical reading comprehension dataset (PubMedQA). We investigate multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), zero- and few-shot (prepending the question with question-answer exemplars) and retrieval augmentation (injecting Wikipedia passages into the prompt). For a subset of the USMLE questions, a medical expert reviewed and annotated the model's CoT. We found that InstructGPT can often read, reason and recall expert knowledge. Failure are primarily due to lack of knowledge and reasoning errors and trivial guessing heuristics are observed, e.g.\ too often predicting labels A and D on USMLE. Sampling and combining many completions overcome some of these limitations. Using 100 samples, Codex 5-shot CoT not only gives close to well-calibrated predictive probability but also achieves human-level performances on the three datasets. USMLE: 60.2%, MedMCQA: 62.7% and PubMedQA: 78.2%.
    To Trust or Not To Trust Prediction Scores for Membership Inference Attacks. (arXiv:2111.09076v3 [cs.LG] UPDATED)
    Membership inference attacks (MIAs) aim to determine whether a specific sample was used to train a predictive model. Knowing this may indeed lead to a privacy breach. Most MIAs, however, make use of the model's prediction scores - the probability of each output given some input - following the intuition that the trained model tends to behave differently on its training data. We argue that this is a fallacy for many modern deep network architectures. Consequently, MIAs will miserably fail since overconfidence leads to high false-positive rates not only on known domains but also on out-of-distribution data and implicitly acts as a defense against MIAs. Specifically, using generative adversarial networks, we are able to produce a potentially infinite number of samples falsely classified as part of the training data. In other words, the threat of MIAs is overestimated, and less information is leaked than previously assumed. Moreover, there is actually a trade-off between the overconfidence of models and their susceptibility to MIAs: the more classifiers know when they do not know, making low confidence predictions, the more they reveal the training data.
    Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials. (arXiv:2202.05255v3 [cond-mat.mtrl-sci] UPDATED)
    Topological materials present unconventional electronic properties that make them attractive for both basic science and next-generation technological applications. The majority of currently known topological materials have been discovered using methods that involve symmetry-based analysis of the quantum wavefunction. Here we use machine learning to develop a simple-to-use heuristic chemical rule that diagnoses with a high accuracy whether a material is topological using only its chemical formula. This heuristic rule is based on a notion that we term topogivity, a machine-learned numerical value for each element that loosely captures its tendency to form topological materials. We next implement a high-throughput procedure for discovering topological materials based on the heuristic topogivity-rule prediction followed by ab initio validation. This way, we discover new topological materials that are not diagnosable using symmetry indicators, including several that may be promising for experimental observation.
    Gaze-based Object Detection in the Wild. (arXiv:2203.15651v2 [cs.RO] UPDATED)
    In human-robot collaboration, one challenging task is to teach a robot new yet unknown objects enabling it to interact with them. Thereby, gaze can contain valuable information. We investigate if it is possible to detect objects (object or no object) merely from gaze data and determine their bounding box parameters. For this purpose, we explore different sizes of temporal windows, which serve as a basis for the computation of heatmaps, i.e., the spatial distribution of the gaze data. Additionally, we analyze different grid sizes of these heatmaps, and demonstrate the functionality in a proof of concept using different machine learning techniques. Our method is characterized by its speed and resource efficiency compared to conventional object detectors. In order to generate the required data, we conducted a study with five subjects who could move freely and thus, turn towards arbitrary objects. This way, we chose a scenario for our data collection that is as realistic as possible. Since the subjects move while facing objects, the heatmaps also contain gaze data trajectories, complicating the detection and parameter regression. We make our data set publicly available to the research community for download.
    Mixed Effects Random Forests for Personalised Predictions of Clinical Depression Severity. (arXiv:2301.09815v1 [cs.LG])
    This work demonstrates how mixed effects random forests enable accurate predictions of depression severity using multimodal physiological and digital activity data collected from an 8-week study involving 31 patients with major depressive disorder. We show that mixed effects random forests outperform standard random forests and personal average baselines when predicting clinical Hamilton Depression Rating Scale scores (HDRS_17). Compared to the latter baseline, accuracy is significantly improved for each patient by an average of 0.199-0.276 in terms of mean absolute error (p<0.05). This is noteworthy as these simple baselines frequently outperform machine learning methods in mental health prediction tasks. We suggest that this improved performance results from the ability of the mixed effects random forest to personalise model parameters to individuals in the dataset. However, we find that these improvements pertain exclusively to scenarios where labelled patient data are available to the model at training time. Investigating methods that improve accuracy when generalising to new patients is left as important future work.
    Integrating Reward Maximization and Population Estimation: Sequential Decision-Making for Internal Revenue Service Audit Selection. (arXiv:2204.11910v3 [cs.LG] UPDATED)
    We introduce a new setting, optimize-and-estimate structured bandits. Here, a policy must select a batch of arms, each characterized by its own context, that would allow it to both maximize reward and maintain an accurate (ideally unbiased) population estimate of the reward. This setting is inherent to many public and private sector applications and often requires handling delayed feedback, small data, and distribution shifts. We demonstrate its importance on real data from the United States Internal Revenue Service (IRS). The IRS performs yearly audits of the tax base. Two of its most important objectives are to identify suspected misreporting and to estimate the "tax gap" -- the global difference between the amount paid and true amount owed. Based on a unique collaboration with the IRS, we cast these two processes as a unified optimize-and-estimate structured bandit. We analyze optimize-and-estimate approaches to the IRS problem and propose a novel mechanism for unbiased population estimation that achieves rewards comparable to baseline approaches. This approach has the potential to improve audit efficacy, while maintaining policy-relevant estimates of the tax gap. This has important social consequences given that the current tax gap is estimated at nearly half a trillion dollars. We suggest that this problem setting is fertile ground for further research and we highlight its interesting challenges. The results of this and related research are currently being incorporated into the continual improvement of the IRS audit selection methods.
    Efficient Planning in a Compact Latent Action Space. (arXiv:2208.10291v3 [cs.LG] UPDATED)
    Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.
    A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range. (arXiv:2203.13273v5 [cs.LG] UPDATED)
    We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter epsilon within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient g_t and its first momentum m_t before using them to estimate the second momentum. The new optimization method is referred to as Aida. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet. Code is available at this URL
    Planckian Jitter: countering the color-crippling effects of color jitter on self-supervised training. (arXiv:2202.07993v2 [cs.CV] UPDATED)
    Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation - which we call Planckian Jitter - that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets. In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations.
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.
    MTTN: Multi-Pair Text to Text Narratives for Prompt Generation. (arXiv:2301.10172v1 [cs.CL])
    The explosive popularity of diffusion models[ 1][ 2][ 3 ] has provided a huge stage for further development in generative-text modelling. As prompt based models are very nuanced, such that a carefully generated prompt can produce truely breath taking images, on the contrary producing powerful or even meaningful prompt is a hit or a miss. To lavish on this we have introduced a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO[4 ], Flickr[ 5], etc. We have also introduced staging for these sentences that sequentially reduce the context and increase the complexity, that will further strengthen the output because of the complex annotations that are being created. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations. The original 2.4M million pairs are broken down in such a manner that it produces a true scenario of internet lingo that is used globally thereby heightening the robustness of the dataset, and any model trained on it.
    RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving. (arXiv:2301.10222v1 [cs.CV])
    Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at https://github.com/valeoai/rangevit.
    A Watermark for Large Language Models. (arXiv:2301.10226v1 [cs.LG])
    Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of whitelist tokens before a word is generated, and then softly promoting use of whitelist tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
    HTMOT : Hierarchical Topic Modelling Over Time. (arXiv:2112.03104v2 [cs.IR] UPDATED)
    Over the years, topic models have provided an efficient way of extracting insights from text. However, while many models have been proposed, none are able to model topic temporality and hierarchy jointly. Modelling time provide more precise topics by separating lexically close but temporally distinct topics while modelling hierarchy provides a more detailed view of the content of a document corpus. In this study, we therefore propose a novel method, HTMOT, to perform Hierarchical Topic Modelling Over Time. We train HTMOT using a new implementation of Gibbs sampling, which is more efficient. Specifically, we show that only applying time modelling to deep sub-topics provides a way to extract specific stories or events while high level topics extract larger themes in the corpus. Our results show that our training procedure is fast and can extract accurate high-level topics and temporally precise sub-topics. We measured our model's performance using the Word Intrusion task and outlined some limitations of this evaluation method, especially for hierarchical models. As a case study, we focused on the various developments in the space industry in 2020.
    Leveraging Vision-Language Models for Granular Market Change Prediction. (arXiv:2301.10166v1 [q-fin.ST])
    Predicting future direction of stock markets using the historical data has been a fundamental component in financial forecasting. This historical data contains the information of a stock in each specific time span, such as the opening, closing, lowest, and highest price. Leveraging this data, the future direction of the market is commonly predicted using various time-series models such as Long-Short Term Memory networks. This work proposes modeling and predicting market movements with a fundamentally new approach, namely by utilizing image and byte-based number representation of the stock data processed with the recently introduced Vision-Language models. We conduct a large set of experiments on the hourly stock data of the German share index and evaluate various architectures on stock price prediction using historical stock data. We conduct a comprehensive evaluation of the results with various metrics to accurately depict the actual performance of various approaches. Our evaluation results show that our novel approach based on representation of stock data as text (bytes) and image significantly outperforms strong deep learning-based baselines.
    WEASEL 2.0 -- A Random Dilated Dictionary Transform for Fast, Accurate and Memory Constrained Time Series Classification. (arXiv:2301.10194v1 [cs.LG])
    A time series is a sequence of sequentially ordered real values in time. Time series classification (TSC) is the task of assigning a time series to one of a set of predefined classes, usually based on a model learned from examples. Dictionary-based methods for TSC rely on counting the frequency of certain patterns in time series and are important components of the currently most accurate TSC ensembles. One of the early dictionary-based methods was WEASEL, which at its time achieved SotA results while also being very fast. However, it is outperformed both in terms of speed and accuracy by other methods. Furthermore, its design leads to an unpredictably large memory footprint, making it inapplicable for many applications. In this paper, we present WEASEL 2.0, a complete overhaul of WEASEL based on two recent advancements in TSC: Dilation and ensembling of randomized hyper-parameter settings. These two techniques allow WEASEL 2.0 to work with a fixed-size memory footprint while at the same time improving accuracy. Compared to 15 other SotA methods on the UCR benchmark set, WEASEL 2.0 is significantly more accurate than other dictionary methods and not significantly worse than the currently best methods. Actually, it achieves the highest median accuracy over all data sets, and it performs best in 5 out of 12 problem classes. We thus believe that WEASEL 2.0 is a viable alternative for current TSC and also a potentially interesting input for future ensembles.
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.
    Fine-grained Early Frequency Attention for Deep Speaker Representation Learning. (arXiv:2009.01822v2 [eess.AS] UPDATED)
    Deep learning techniques have considerably improved speech processing in recent years. Speaker representations extracted by deep learning models are being used in a wide range of tasks such as speaker recognition and speech emotion recognition. Attention mechanisms have started to play an important role in improving deep learning models in the field of speech processing. Nonetheless, despite the fact that important speaker-related information can be embedded in individual frequency-bins of the input spectral representations, current attention models are unable to attend to fine-grained information items in spectral representations. In this paper we propose Fine-grained Early Frequency Attention (FEFA) for speaker representation learning. Our model is a simple and lightweight model that can be integrated into various CNN pipelines and is capable of focusing on information items as small as frequency-bins. We evaluate the proposed model on three tasks of speaker recognition, speech emotion recognition, and spoken digit recognition. We use Three widely used public datasets, namely VoxCeleb, IEMOCAP, and Free Spoken Digit Dataset for our experiments. We attach FEFA to several prominent deep learning models and evaluate its impact on the final performance. We also compare our work with other related works in the area. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, and the models equipped with FEFA outperform all the other attentive models. We also test our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Topological Understanding of Neural Networks, a survey. (arXiv:2301.09742v1 [cs.LG])
    We look at the internal structure of neural networks which is usually treated as a black box. The easiest and the most comprehensible thing to do is to look at a binary classification and try to understand the approach a neural network takes. We review the significance of different activation functions, types of network architectures associated to them, and some empirical data. We find some interesting observations and a possibility to build upon the ideas to verify the process for real datasets. We suggest some possible experiments to look forward to in three different directions.
    A predictive physics-aware hybrid reduced order model for reacting flows. (arXiv:2301.09860v1 [cs.LG])
    In this work, a new hybrid predictive Reduced Order Model (ROM) is proposed to solve reacting flow problems. This algorithm is based on a dimensionality reduction using Proper Orthogonal Decomposition (POD) combined with deep learning architectures. The number of degrees of freedom is reduced from thousands of temporal points to a few POD modes with their corresponding temporal coefficients. Two different deep learning architectures have been tested to predict the temporal coefficients, based on recursive (RNN) and convolutional (CNN) neural networks. From each architecture, different models have been created to understand the behavior of each parameter of the neural network. Results show that these architectures are able to predict the temporal coefficients of the POD modes, as well as the whole snapshots. The RNN shows lower prediction error for all the variables analyzed. The model was also found capable of predicting more complex simulations showing transfer learning capabilities.
    Towards Modular Machine Learning Solution Development: Benefits and Trade-offs. (arXiv:2301.09753v1 [cs.LG])
    Machine learning technologies have demonstrated immense capabilities in various domains. They play a key role in the success of modern businesses. However, adoption of machine learning technologies has a lot of untouched potential. Cost of developing custom machine learning solutions that solve unique business problems is a major inhibitor to far-reaching adoption of machine learning technologies. We recognize that the monolithic nature prevalent in today's machine learning applications stands in the way of efficient and cost effective customized machine learning solution development. In this work we explore the benefits of modular machine learning solutions and discuss how modular machine learning solutions can overcome some of the major solution engineering limitations of monolithic machine learning solutions. We analyze the trade-offs between modular and monolithic machine learning solutions through three deep learning problems; one text based and the two image based. Our experimental results show that modular machine learning solutions have a promising potential to reap the solution engineering advantages of modularity while gaining performance and data advantages in a way the monolithic machine learning solutions do not permit.
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Interpretable Tsetlin Machine-based Premature Ventricular Contraction Identification. (arXiv:2301.10181v1 [eess.SP])
    Neural network-based models have found wide use in automatic long-term electrocardiogram (ECG) analysis. However, such black box models are inadequate for analysing physiological signals where credibility and interpretability are crucial. Indeed, how to make ECG analysis transparent is still an open problem. In this study, we develop a Tsetlin machine (TM) based architecture for premature ventricular contraction (PVC) identification by analysing long-term ECG signals. The architecture is transparent by describing patterns directly with logical AND rules. To validate the accuracy of our approach, we compare the TM performance with those of convolutional neural networks (CNNs). Our numerical results demonstrate that TM provides comparable performance with CNNs on the MIT-BIH database. To validate interpretability, we provide explanatory diagrams that show how TM makes the PVC identification from confirming and invalidating patterns. We argue that these are compatible with medical knowledge so that they can be readily understood and verified by a medical doctor. Accordingly, we believe this study paves the way for machine learning (ML) for ECG analysis in clinical practice.
    Inference of Continuous Linear Systems from Data with Guaranteed Stability. (arXiv:2301.10060v1 [cs.LG])
    Machine-learning technologies for learning dynamical systems from data play an important role in engineering design. This research focuses on learning continuous linear models from data. Stability, a key feature of dynamic systems, is especially important in design tasks such as prediction and control. Thus, there is a need to develop methodologies that provide stability guarantees. To that end, we leverage the parameterization of stable matrices proposed in [Gillis/Sharma, Automatica, 2017] to realize the desired models. Furthermore, to avoid the estimation of derivative information to learn continuous systems, we formulate the inference problem in an integral form. We also discuss a few extensions, including those related to control systems. Numerical experiments show that the combination of a stable matrix parameterization and an integral form of differential equations allows us to learn stable systems without requiring derivative information, which can be challenging to obtain in situations with noisy or limited data.
    Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning. (arXiv:2301.10119v1 [cs.LG])
    Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-equivalent partial models". After providing a formal definition for these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Then, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.  ( 2 min )
    Autonomous particles. (arXiv:2301.10077v1 [cs.LG])
    Consider a reinforcement learning problem where an agent has access to a very large amount of information about the environment, but it can only take very few actions to accomplish its task and to maximize its reward. Evidently, the main problem for the agent is to learn a map from a very high-dimensional space (which represents its environment) to a very low-dimensional space (which represents its actions). The high-to-low dimensional map implies that most of the information about the environment is irrelevant for the actions to be taken, and only a small fraction of information is relevant. In this paper we argue that the relevant information need not be learned by brute force (which is the standard approach), but can be identified from the intrinsic symmetries of the system. We analyze in details a reinforcement learning problem of autonomous driving, where the corresponding symmetry is the Galilean symmetry, and argue that the learning task can be accomplished with very few relevant parameters, or, more precisely, invariants. For a numerical demonstration, we show that the autonomous vehicles (which we call autonomous particles since they describe very primitive vehicles) need only four relevant invariants to learn how to drive very well without colliding with other particles. The simple model can be easily generalized to include different types of particles (e.g. for cars, for pedestrians, for buildings, for road signs, etc.) with different types of relevant invariants describing interactions between them. We also argue that there must exist a field theory description of the learning system where autonomous particles would be described by fermionic degrees of freedom and interactions mediated by the relevant invariants would be described by bosonic degrees of freedom.  ( 2 min )
    Differentiable bit-rate estimation for neural-based video codec enhancement. (arXiv:2301.09776v1 [eess.IV])
    Neural networks (NN) can improve standard video compression by pre- and post-processing the encoded video. For optimal NN training, the standard codec needs to be replaced with a codec proxy that can provide derivatives of estimated bit-rate and distortion, which are used for gradient back-propagation. Since entropy coding of standard codecs is designed to take into account non-linear dependencies between transform coefficients, bit-rates cannot be well approximated with simple per-coefficient estimators. This paper presents a new approach for bit-rate estimation that is similar to the type employed in training end-to-end neural codecs, and able to efficiently take into account those statistical dependencies. It is defined from a mathematical model that provides closed-form formulas for the estimates and their gradients, reducing the computational complexity. Experimental results demonstrate the method's accuracy in estimating HEVC/H.265 codec bit-rates.  ( 2 min )
    Explainable Data-Driven Optimization: From Context to Decision and Back Again. (arXiv:2301.10074v1 [cs.LG])
    Data-driven optimization uses contextual information and machine learning algorithms to find solutions to decision problems with uncertain parameters. While a vast body of work is dedicated to interpreting machine learning models in the classification setting, explaining decision pipelines involving learning algorithms remains unaddressed. This lack of interpretability can block the adoption of data-driven solutions as practitioners may not understand or trust the recommended decisions. We bridge this gap by introducing a counterfactual explanation methodology tailored to explain solutions to data-driven problems. We introduce two classes of explanations and develop methods to find nearest explanations of random forest and nearest-neighbor predictors. We demonstrate our approach by explaining key problems in operations management such as inventory management and routing.  ( 2 min )
    PolarAir: A Compressed Sensing Scheme for Over-the-Air Federated Learning. (arXiv:2301.10110v1 [cs.IT])
    We explore a scheme that enables the training of a deep neural network in a Federated Learning configuration over an additive white Gaussian noise channel. The goal is to create a low complexity, linear compression strategy, called PolarAir, that reduces the size of the gradient at the user side to lower the number of channel uses needed to transmit it. The suggested approach belongs to the family of compressed sensing techniques, yet it constructs the sensing matrix and the recovery procedure using multiple access techniques. Simulations show that it can reduce the number of channel uses by ~30% when compared to conveying the gradient without compression. The main advantage of the proposed scheme over other schemes in the literature is its low time complexity. We also investigate the behavior of gradient updates and the performance of PolarAir throughout the training process to obtain insight on how best to construct this compression scheme based on compressed sensing.  ( 2 min )
    Intrinsic Motivation in Model-based Reinforcement Learning: A Brief Review. (arXiv:2301.10067v1 [cs.LG])
    The reinforcement learning research area contains a wide range of methods for solving the problems of intelligent agent control. Despite the progress that has been made, the task of creating a highly autonomous agent is still a significant challenge. One potential solution to this problem is intrinsic motivation, a concept derived from developmental psychology. This review considers the existing methods for determining intrinsic motivation based on the world model obtained by the agent. We propose a systematic approach to current research in this field, which consists of three categories of methods, distinguished by the way they utilize a world model in the agent's components: complementary intrinsic reward, exploration policy, and intrinsically motivated goals. The proposed unified framework describes the architecture of agents using a world model and intrinsic motivation to improve learning. The potential for developing new techniques in this area of research is also examined.  ( 2 min )
    A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data. (arXiv:2301.10053v1 [cs.LG])
    Personal data collected at scale from surveys or digital devices offers important insights for statistical analysis and scientific research. Safely sharing such data while protecting privacy is however challenging. Anonymization allows data to be shared while minimizing privacy risks, but traditional anonymization techniques have been repeatedly shown to provide limited protection against re-identification attacks in practice. Among modern anonymization techniques, synthetic data generation (SDG) has emerged as a potential solution to find a good tradeoff between privacy and statistical utility. Synthetic data is typically generated using algorithms that learn the statistical distribution of the original records, to then generate "artificial" records that are structurally and statistically similar to the original ones. Yet, the fact that synthetic records are "artificial" does not, per se, guarantee that privacy is protected. In this work, we systematically evaluate the tradeoffs between protecting privacy and preserving statistical utility for a wide range of synthetic data generation algorithms. Modeling privacy as protection against attribute inference attacks (AIAs), we extend and adapt linear reconstruction attacks, which have not been previously studied in the context of synthetic data. While prior work suggests that AIAs may be effective only on few outlier records, we show they can be very effective even on randomly selected records. We evaluate attacks on synthetic datasets ranging from 10^3 to 10^6 records, showing that even for the same generative model, the attack effectiveness can drastically increase when a larger number of synthetic records is generated. Overall, our findings prove that synthetic data is subject to privacy-utility tradeoffs just like other anonymization techniques: when good utility is preserved, attribute inference can be a risk for many data subjects.  ( 2 min )
    SMART: Self-supervised Multi-task pretrAining with contRol Transformers. (arXiv:2301.09816v1 [cs.LG])
    Self-supervised pretraining has been extensively studied in language and vision domains, where a unified model can be easily adapted to various downstream tasks by pretraining representations without explicit labels. When it comes to sequential decision-making tasks, however, it is difficult to properly design such a pretraining approach that can cope with both high-dimensional perceptual information and the complexity of sequential control over long interaction horizons. The challenge becomes combinatorially more complex if we want to pretrain representations amenable to a large variety of tasks. To tackle this problem, in this work, we formulate a general pretraining-finetuning pipeline for sequential decision making, under which we propose a generic pretraining framework \textit{Self-supervised Multi-task pretrAining with contRol Transformer (SMART)}. By systematically investigating pretraining regimes, we carefully design a Control Transformer (CT) coupled with a novel control-centric pretraining objective in a self-supervised manner. SMART encourages the representation to capture the common essential information relevant to short-term control and long-term control, which is transferrable across tasks. We show by extensive experiments in DeepMind Control Suite that SMART significantly improves the learning efficiency among seen and unseen downstream tasks and domains under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality pretraining datasets that are randomly collected.  ( 2 min )
    Koopman neural operator as a mesh-free solver of non-linear partial differential equations. (arXiv:2301.10022v1 [cs.LG])
    The lacking of analytic solutions of diverse partial differential equations (PDEs) gives birth to series of computational techniques for numerical solutions. In machine learning, numerous latest advances of solver designs are accomplished in developing neural operators, a kind of mesh-free approximators of the infinite-dimensional operators that map between different parameterization spaces of equation solutions. Although neural operators exhibit generalization capacities for learning an entire PDE family simultaneously, they become less accurate and explainable while learning long-term behaviours of non-linear PDE families. In this paper, we propose Koopman neural operator (KNO), a new neural operator, to overcome these challenges. With the same objective of learning an infinite-dimensional mapping between Banach spaces that serves as the solution operator of target PDE family, our approach differs from existing models by formulating a non-linear dynamic system of equation solution. By approximating the Koopman operator, an infinite-dimensional linear operator governing all possible observations of the dynamic system, to act on the flow mapping of dynamic system, we can equivalently learn the solution of an entire non-linear PDE family by solving simple linear prediction problems. In zero-shot prediction and long-term prediction experiments on representative PDEs (e.g., the Navier-Stokes equation), KNO exhibits notable advantages in breaking the tradeoff between accuracy and efficiency (e.g., model size) while previous state-of-the-art models are limited. These results suggest that more efficient PDE solvers can be developed by the joint efforts from physics and machine learning.  ( 2 min )
    Robust Fair Clustering: A Novel Fairness Attack and Defense Framework. (arXiv:2210.01953v2 [cs.LG] UPDATED)
    Clustering algorithms are widely used in many societal resource allocation applications, such as loan approvals and candidate recruitment, among others, and hence, biased or unfair model outputs can adversely impact individuals that rely on these applications. To this end, many fair clustering approaches have been recently proposed to counteract this issue. Due to the potential for significant harm, it is essential to ensure that fair clustering algorithms provide consistently fair outputs even under adversarial influence. However, fair clustering algorithms have not been studied from an adversarial attack perspective. In contrast to previous research, we seek to bridge this gap and conduct a robustness analysis against fair clustering by proposing a novel black-box fairness attack. Through comprehensive experiments, we find that state-of-the-art models are highly susceptible to our attack as it can reduce their fairness performance significantly. Finally, we propose Consensus Fair Clustering (CFC), the first robust fair clustering approach that transforms consensus clustering into a fair graph partitioning problem, and iteratively learns to generate fair cluster outputs. Experimentally, we observe that CFC is highly robust to the proposed attack and is thus a truly robust fair clustering alternative.  ( 2 min )
    3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation. (arXiv:2209.15076v3 [cs.CV] UPDATED)
    The recent 3D medical ViTs (e.g., SwinUNETR) achieve the state-of-the-art performances on several 3D volumetric data benchmarks, including 3D medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local self-attention and the large number of model parameters. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7\times7\times7$) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880 to 0.900). The source code with our proposed model are available at https://github.com/MASILab/3DUX-Net.  ( 2 min )
    Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression. (arXiv:2301.09830v1 [cs.LG])
    In training of modern large natural language processing (NLP) models, it has become a common practice to split models using 3D parallelism to multiple GPUs. Such technique, however, suffers from a high overhead of inter-node communication. Compressing the communication is one way to mitigate the overhead by reducing the inter-node traffic volume; however, the existing compression techniques have critical limitations to be applied for NLP models with 3D parallelism in that 1) only the data parallelism traffic is targeted, and 2) the existing compression schemes already harm the model quality too much. In this paper, we present Optimus-CC, a fast and scalable distributed training framework for large NLP models with aggressive communication compression. Optimus-CC differs from existing communication compression frameworks in the following ways: First, we compress pipeline parallel (inter-stage) traffic. In specific, we compress the inter-stage backpropagation and the embedding synchronization in addition to the existing data-parallel traffic compression methods. Second, we propose techniques to avoid the model quality drop that comes from the compression. We further provide mathematical and empirical analyses to show that our techniques can successfully suppress the compression error. Lastly, we analyze the pipeline and opt to selectively compress those traffic lying on the critical path. This further helps reduce the compression error. We demonstrate our solution on a GPU cluster, and achieve superior speedup from the baseline state-of-the-art solutions for distributed training without sacrificing the model quality.  ( 2 min )
    When does the student surpass the teacher? Federated Semi-supervised Learning with Teacher-Student EMA. (arXiv:2301.10114v1 [cs.LG])
    Semi-Supervised Learning (SSL) has received extensive attention in the domain of computer vision, leading to development of promising approaches such as FixMatch. In scenarios where training data is decentralized and resides on client devices, SSL must be integrated with privacy-aware training techniques such as Federated Learning. We consider the problem of federated image classification and study the performance and privacy challenges with existing federated SSL (FSSL) approaches. Firstly, we note that even state-of-the-art FSSL algorithms can trivially compromise client privacy and other real-world constraints such as client statelessness and communication cost. Secondly, we observe that it is challenging to integrate EMA (Exponential Moving Average) updates into the federated setting, which comes at a trade-off between performance and communication cost. We propose a novel approach FedSwitch, that improves privacy as well as generalization performance through Exponential Moving Average (EMA) updates. FedSwitch utilizes a federated semi-supervised teacher-student EMA framework with two features - local teacher adaptation and adaptive switching between teacher and student for pseudo-label generation. Our proposed approach outperforms the state-of-the-art on federated image classification, can be adapted to real-world constraints, and achieves good generalization performance with minimal communication cost overhead.  ( 2 min )
    Proceedings of the 1st International Workshop on Reading Music Systems. (arXiv:2301.10062v1 [cs.CV])
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 1st International Workshop on Reading Music Systems, held in Paris on the 20th of September 2018.  ( 2 min )
    Quantum Heavy-tailed Bandits. (arXiv:2301.09680v1 [cs.LG])
    In this paper, we study multi-armed bandits (MAB) and stochastic linear bandits (SLB) with heavy-tailed rewards and quantum reward oracle. Unlike the previous work on quantum bandits that assumes bounded/sub-Gaussian distributions for rewards, here we investigate the quantum bandits problem under a weaker assumption that the distributions of rewards only have bounded $(1+v)$-th moment for some $v\in (0,1]$. In order to achieve regret improvements for heavy-tailed bandits, we first propose a new quantum mean estimator for heavy-tailed distributions, which is based on the Quantum Monte Carlo Mean Estimator and achieves a quadratic improvement of estimation error compared to the classical one. Based on our quantum mean estimator, we focus on quantum heavy-tailed MAB and SLB and propose quantum algorithms based on the Upper Confidence Bound (UCB) framework for both problems with $\Tilde{O}(T^{\frac{1-v}{1+v}})$ regrets, polynomially improving the dependence in terms of $T$ as compared to classical (near) optimal regrets of $\Tilde{O}(T^{\frac{1}{1+v}})$, where $T$ is the number of rounds. Finally, experiments also support our theoretical results and show the effectiveness of our proposed methods.  ( 2 min )
    From Robots to Books: An Introduction to Smart Applications of AI in Education (AIEd). (arXiv:2301.10026v1 [cs.CY])
    The world around us has undergone a radical transformation due to rapid technological advancement in recent decades. The industry of the future generation is evolving, and artificial intelligence is the following change in the making popularly known as Industry 4.0. Indeed, experts predict that artificial intelligence(AI) will be the main force behind the following significant virtual shift in the way we stay, converse, study, live, communicate and conduct business. All facets of our social connection are being transformed by this growing technology. One of the newest areas of educational technology is Artificial Intelligence in the field of Education(AIEd).This study emphasizes the different applications of artificial intelligence in education from both an industrial and academic standpoint. It highlights the most recent contextualized learning novel transformative evaluations and advancements in sophisticated tutoring systems. It analyses the AIEd's ethical component and the influence of the transition on people, particularly students and instructors as well. Finally, this article touches on AIEd's potential future research and practices. The goal of this study is to introduce the present-day applications to its intended audience.  ( 2 min )
    FedPrompt: Communication-Efficient and Privacy Preserving Prompt Tuning in Federated Learning. (arXiv:2208.12268v3 [cs.LG] UPDATED)
    Federated learning (FL) has enabled global model training on decentralized data in a privacy-preserving way by aggregating model updates. However, for many natural language processing (NLP) tasks that utilize pre-trained language models (PLMs) with large numbers of parameters, there are considerable communication costs associated with FL. Recently, prompt tuning, which tunes some soft prompts without modifying PLMs, has achieved excellent performance as a new learning paradigm. Therefore we want to combine the two methods and explore the effect of prompt tuning under FL. In this paper, we propose "FedPrompt" to study prompt tuning in a model split aggregation way using FL, and prove that split aggregation greatly reduces the communication cost, only 0.01% of the PLMs' parameters, with little decrease on accuracy both on IID and Non-IID data distribution. This improves the efficiency of FL method while also protecting the data privacy in prompt tuning. In addition, like PLMs, prompts are uploaded and downloaded between public platforms and personal users, so we try to figure out whether there is still a backdoor threat using only soft prompts in FL scenarios. We further conduct backdoor attacks by data poisoning on FedPrompt. Our experiments show that normal backdoor attack can not achieve a high attack success rate, proving the robustness of FedPrompt. We hope this work can promote the application of prompt in FL and raise the awareness of the possible security threats.  ( 2 min )
    Efficient learning of large sets of locally optimal classification rules. (arXiv:2301.09936v1 [cs.LG])
    Conventional rule learning algorithms aim at finding a set of simple rules, where each rule covers as many examples as possible. In this paper, we argue that the rules found in this way may not be the optimal explanations for each of the examples they cover. Instead, we propose an efficient algorithm that aims at finding the best rule covering each training example in a greedy optimization consisting of one specialization and one generalization loop. These locally optimal rules are collected and then filtered for a final rule set, which is much larger than the sets learned by conventional rule learning algorithms. A new example is classified by selecting the best among the rules that cover this example. In our experiments on small to very large datasets, the approach's average classification accuracy is higher than that of state-of-the-art rule learning algorithms. Moreover, the algorithm is highly efficient and can inherently be processed in parallel without affecting the learned rule set and so the classification accuracy. We thus believe that it closes an important gap for large-scale classification rule induction.  ( 2 min )
    Model Agnostic Sample Reweighting for Out-of-Distribution Learning. (arXiv:2301.09819v1 [cs.LG])
    Distributionally robust optimization (DRO) and invariant risk minimization (IRM) are two popular methods proposed to improve out-of-distribution (OOD) generalization performance of machine learning models. While effective for small models, it has been observed that these methods can be vulnerable to overfitting with large overparameterized models. This work proposes a principled method, \textbf{M}odel \textbf{A}gnostic sam\textbf{PL}e r\textbf{E}weighting (\textbf{MAPLE}), to effectively address OOD problem, especially in overparameterized scenarios. Our key idea is to find an effective reweighting of the training samples so that the standard empirical risk minimization training of a large model on the weighted training data leads to superior OOD generalization performance. The overfitting issue is addressed by considering a bilevel formulation to search for the sample reweighting, in which the generalization complexity depends on the search space of sample weights instead of the model size. We present theoretical analysis in linear case to prove the insensitivity of MAPLE to model size, and empirically verify its superiority in surpassing state-of-the-art methods by a large margin. Code is available at \url{https://github.com/x-zho14/MAPLE}.  ( 2 min )
    Spectral Cross-Domain Neural Network with Soft-adaptive Threshold Spectral Enhancement. (arXiv:2301.10171v1 [cs.LG])
    Electrocardiography (ECG) signals can be considered as multi-variable time-series. The state-of-the-art ECG data classification approaches, based on either feature engineering or deep learning techniques, treat separately spectral and time domains in machine learning systems. No spectral-time domain communication mechanism inside the classifier model can be found in current approaches, leading to difficulties in identifying complex ECG forms. In this paper, we proposed a novel deep learning model named Spectral Cross-domain neural network (SCDNN) with a new block called Soft-adaptive threshold spectral enhancement (SATSE), to simultaneously reveal the key information embedded in spectral and time domains inside the neural network. More precisely, the domain-cross information is captured by a general Convolutional neural network (CNN) backbone, and different information sources are merged by a self-adaptive mechanism to mine the connection between time and spectral domains. In SATSE, the knowledge from time and spectral domains is extracted via the Fast Fourier Transformation (FFT) with soft trainable thresholds in modified Sigmoid functions. The proposed SCDNN is tested with several classification tasks implemented on the public ECG databases \textit{PTB-XL} and \textit{MIT-BIH}. SCDNN outperforms the state-of-the-art approaches with a low computational cost regarding a variety of metrics in all classification tasks on both databases, by finding appropriate domains from the infinite spectral mapping. The convergence of the trainable thresholds in the spectral domain is also numerically investigated in this paper. The robust performance of SCDNN provides a new perspective to exploit knowledge across deep learning models from time and spectral domains. The repository can be found: https://github.com/DL-WG/SCDNN-TS  ( 2 min )
    Multi-view Kernel PCA for Time series Forecasting. (arXiv:2301.09811v1 [cs.LG])
    In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.  ( 2 min )
    Quantification of Damage Using Indirect Structural Health Monitoring. (arXiv:2301.09791v1 [cs.LG])
    Structural health monitoring is important to make sure bridges do not fail. Since direct monitoring can be complicated and expensive, indirect methods have been a focus on research. Indirect monitoring can be much cheaper and easier to conduct, however there are challenges with getting accurate results. This work focuses on damage quantification by using accelerometers. Tests were conducted on a model bridge and car with four accelerometers attached to to the vehicle. Different weights were placed on the bridge to simulate different levels of damage, and 31 tests were run for 20 different damage levels. The acceleration data collected was normalized and a Fast-Fourier Transform (FFT) was performed on that data. Both the normalized acceleration data and the normalized FFT data were inputted into a Non-Linear Principal Component Analysis (separately) and three principal components were extracted for each data set. Support Vector Regression (SVR) and Gaussian Process Regression (GPR) were used as the supervised machine learning methods to develop models. Multiple models were created so that the best one could be selected, and the models were compared by looking at their Mean Squared Errors (MSE). This methodology should be applied in the field to measure how effective it can be in real world applications.  ( 2 min )
    Feature-based Image Matching for Identifying Individual K\=ak\=a. (arXiv:2301.06678v2 [cs.CV] UPDATED)
    This report investigates an unsupervised, feature-based image matching pipeline for the novel application of identifying individual k\=ak\=a. Applied with a similarity network for clustering, this addresses a weakness of current supervised approaches to identifying individual birds which struggle to handle the introduction of new individuals to the population. Our approach uses object localisation to locate k\=ak\=a within images and then extracts local features that are invariant to rotation and scale. These features are matched between images with nearest neighbour matching techniques and mismatch removal to produce a similarity score for image match comparison. The results show that matches obtained via the image matching pipeline achieve high accuracy of true matches. We conclude that feature-based image matching could be used with a similarity network to provide a viable alternative to existing supervised approaches.  ( 2 min )
    Accurate Detection of Paroxysmal Atrial Fibrillation with Certified-GAN and Neural Architecture Search. (arXiv:2301.10173v1 [cs.LG])
    This paper presents a novel machine learning framework for detecting Paroxysmal Atrial Fibrillation (PxAF), a pathological characteristic of Electrocardiogram (ECG) that can lead to fatal conditions such as heart attack. To enhance the learning process, the framework involves a Generative Adversarial Network (GAN) along with a Neural Architecture Search (NAS) in the data preparation and classifier optimization phases. The GAN is innovatively invoked to overcome the class imbalance of the training data by producing the synthetic ECG for PxAF class in a certified manner. The effect of the certified GAN is statistically validated. Instead of using a general-purpose classifier, the NAS automatically designs a highly accurate convolutional neural network architecture customized for the PxAF classification task. Experimental results show that the accuracy of the proposed framework exhibits a high value of 99% which not only enhances state-of-the-art by up to 5.1%, but also improves the classification performance of the two widely-accepted baseline methods, ResNet-18, and Auto-Sklearn, by 2.2% and 6.1%.  ( 2 min )
    Dataset Bias in Human Activity Recognition. (arXiv:2301.10161v1 [eess.SP])
    When creating multi-channel time-series datasets for Human Activity Recognition (HAR), researchers are faced with the issue of subject selection criteria. It is unknown what physical characteristics and/or soft-biometrics, such as age, height, and weight, need to be taken into account to train a classifier to achieve robustness towards heterogeneous populations in the training and testing data. This contribution statistically curates the training data to assess to what degree the physical characteristics of humans influence HAR performance. We evaluate the performance of a state-of-the-art convolutional neural network on two HAR datasets that vary in the sensors, activities, and recording for time-series HAR. The training data is intentionally biased with respect to human characteristics to determine the features that impact motion behaviour. The evaluations brought forth the impact of the subjects' characteristics on HAR. Thus, providing insights regarding the robustness of the classifier with respect to heterogeneous populations. The study is a step forward in the direction of fair and trustworthy artificial intelligence by attempting to quantify representation bias in multi-channel time series HAR data.  ( 2 min )
    Domain generalization in deep learning-based mass detection in mammography: A large-scale multi-center study. (arXiv:2201.11620v2 [eess.IV] CROSS LISTED)
    Computer-aided detection systems based on deep learning have shown great potential in breast cancer detection. However, the lack of domain generalization of artificial neural networks is an important obstacle to their deployment in changing clinical environments. In this work, we explore the domain generalization of deep learning methods for mass detection in digital mammography and analyze in-depth the sources of domain shift in a large-scale multi-center setting. To this end, we compare the performance of eight state-of-the-art detection methods, including Transformer-based models, trained in a single domain and tested in five unseen domains. Moreover, a single-source mass detection training pipeline is designed to improve the domain generalization without requiring images from the new domain. The results show that our workflow generalizes better than state-of-the-art transfer learning-based approaches in four out of five domains while reducing the domain shift caused by the different acquisition protocols and scanner manufacturers. Subsequently, an extensive analysis is performed to identify the covariate shifts with bigger effects on the detection performance, such as due to differences in patient age, breast density, mass size, and mass malignancy. Ultimately, this comprehensive study provides key insights and best practices for future research on domain generalization in deep learning-based breast cancer detection.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.
    LDMIC: Learning-based Distributed Multi-view Image Coding. (arXiv:2301.09799v1 [eess.IV])
    Multi-view image compression plays a critical role in 3D-related applications. Existing methods adopt a predictive coding architecture, which requires joint encoding to compress the corresponding disparity as well as residual information. This demands collaboration among cameras and enforces the epipolar geometric constraint between different views, which makes it challenging to deploy these methods in distributed camera systems with randomly overlapping fields of view. Meanwhile, distributed source coding theory indicates that efficient data compression of correlated sources can be achieved by independent encoding and joint decoding, which motivates us to design a learning-based distributed multi-view image coding (LDMIC) framework. With independent encoders, LDMIC introduces a simple yet effective joint context transfer module based on the cross-attention mechanism at the decoder to effectively capture the global inter-view correlations, which is insensitive to the geometric relationships between images. Experimental results show that LDMIC significantly outperforms both traditional and learning-based MIC methods while enjoying fast encoding speed. Code will be released at https://github.com/Xinjie-Q/LDMIC.  ( 2 min )
    Explainable Deep Reinforcement Learning: State of the Art and Challenges. (arXiv:2301.09937v1 [cs.LG])
    Interpretability, explainability and transparency are key issues to introducing Artificial Intelligence methods in many critical domains: This is important due to ethical concerns and trust issues strongly connected to reliability, robustness, auditability and fairness, and has important consequences towards keeping the human in the loop in high levels of automation, especially in critical cases for decision making, where both (human and the machine) play important roles. While the research community has given much attention to explainability of closed (or black) prediction boxes, there are tremendous needs for explainability of closed-box methods that support agents to act autonomously in the real world. Reinforcement learning methods, and especially their deep versions, are such closed-box methods. In this article we aim to provide a review of state of the art methods for explainable deep reinforcement learning methods, taking also into account the needs of human operators - i.e., of those that take the actual and critical decisions in solving real-world problems. We provide a formal specification of the deep reinforcement learning explainability problems, and we identify the necessary components of a general explainable reinforcement learning framework. Based on these, we provide a comprehensive review of state of the art methods, categorizing them in classes according to the paradigm they follow, the interpretable models they use, and the surface representation of explanations provided. The article concludes identifying open questions and important challenges.  ( 2 min )
    Predicting Socio-Economic Well-being Using Mobile Apps Data: A Case Study of France. (arXiv:2301.09986v1 [cs.CY])
    Socio-economic indicators provide context for assessing a country's overall condition. These indicators contain information about education, gender, poverty, employment, and other factors. Therefore, reliable and accurate information is critical for social research and government policing. Most data sources available today, such as censuses, have sparse population coverage or are updated infrequently. Nonetheless, alternative data sources, such as call data records (CDR) and mobile app usage, can serve as cost-effective and up-to-date sources for identifying socio-economic indicators. This work investigates mobile app data to predict socio-economic features. We present a large-scale study using data that captures the traffic of thousands of mobile applications by approximately 30 million users distributed over 550,000 km square and served by over 25,000 base stations. The dataset covers the whole France territory and spans more than 2.5 months, starting from 16th March 2019 to 6th June 2019. Using the app usage patterns, our best model can estimate socio-economic indicators (attaining an R-squared score upto 0.66). Furthermore, using models' explainability, we discover that mobile app usage patterns have the potential to reveal socio-economic disparities in IRIS. Insights of this study provide several avenues for future interventions, including users' temporal network analysis and exploration of alternative data sources.  ( 2 min )
    Investigating Labeler Bias in Face Annotation for Machine Learning. (arXiv:2301.09902v1 [cs.LG])
    In a world increasingly reliant on artificial intelligence, it is more important than ever to consider the ethical implications of artificial intelligence on humanity. One key under-explored challenge is labeler bias, which can create inherently biased datasets for training and subsequently lead to inaccurate or unfair decisions in healthcare, employment, education, and law enforcement. Hence, we conducted a study to investigate and measure the existence of labeler bias using images of people from different ethnicities and sexes in a labeling task. Our results show that participants possess stereotypes that influence their decision-making process and that labeler demographics impact assigned labels. We also discuss how labeler bias influences datasets and, subsequently, the models trained on them. Overall, a high degree of transparency must be maintained throughout the entire artificial intelligence training process to identify and correct biases in the data as early as possible.  ( 2 min )
    Membership Inference of Diffusion Models. (arXiv:2301.09956v1 [cs.CR])
    Recent years have witnessed the tremendous success of diffusion models in data synthesis. However, when diffusion models are applied to sensitive data, they also give rise to severe privacy concerns. In this paper, we systematically present the first study about membership inference attacks against diffusion models, which aims to infer whether a sample was used to train the model. Two attack methods are proposed, namely loss-based and likelihood-based attacks. Our attack methods are evaluated on several state-of-the-art diffusion models, over different datasets in relation to privacy-sensitive data. Extensive experimental evaluations show that our attacks can achieve remarkable performance. Furthermore, we exhaustively investigate various factors which can affect attack performance. Finally, we also evaluate the performance of our attack methods on diffusion models trained with differential privacy.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks: Applications in neutron transport. (arXiv:2301.09991v1 [cs.CE])
    In this paper we solve the Boltzmann transport equation using AI libraries. The reason why this is attractive is because it enables one to use the highly optimised software within AI libraries, enabling one to run on different computer architectures and enables one to tap into the vast quantity of community based software that has been developed for AI and ML applications e.g. mixed arithmetic precision or model parallelism. Here we take the first steps towards developing this approach for the Boltzmann transport equation and develop the necessary methods in order to do that effectively. This includes: 1) A space-angle multigrid solution method that can extract the level of parallelism necessary to run efficiently on GPUs or new AI computers. 2) A new Convolutional Finite Element Method (ConvFEM) that greatly simplifies the implementation of high order finite elements (quadratic to quintic, say). 3) A new non-linear Petrov-Galerkin method that introduces dissipation anisotropically.  ( 2 min )
    Fair and skill-diverse student group formation via constrained k-way graph partitioning. (arXiv:2301.09984v1 [cs.LG])
    Forming the right combination of students in a group promises to enable a powerful and effective environment for learning and collaboration. However, defining a group of students is a complex task which has to satisfy multiple constraints. This work introduces an unsupervised algorithm for fair and skill-diverse student group formation. This is achieved by taking account of student course marks and sensitive attributes provided by the education office. The skill sets of students are determined using unsupervised dimensionality reduction of course mark data via the Laplacian eigenmap. The problem is formulated as a constrained graph partitioning problem, whereby the diversity of skill sets in each group are maximised, group sizes are upper and lower bounded according to available resources, and `balance' of a sensitive attribute is lower bounded to enforce fairness in group formation. This optimisation problem is solved using integer programming and its effectiveness is demonstrated on a dataset of student course marks from Imperial College London.  ( 2 min )
    The Backpropagation algorithm for a math student. (arXiv:2301.09977v1 [cs.LG])
    A Deep Neural Network (DNN) is a composite function of vector-valued functions, and in order to train a DNN, it is necessary to calculate the gradient of the loss function with respect to all parameters. This calculation can be a non-trivial task because the loss function of a DNN is a composition of several nonlinear functions, each with numerous parameters. The Backpropagation (BP) algorithm leverages the composite structure of the DNN to efficiently compute the gradient. As a result, the number of layers in the network does not significantly impact the complexity of the calculation. The objective of this paper is to express the gradient of the loss function in terms of a matrix multiplication using the Jacobian operator. This can be achieved by considering the total derivative of each layer with respect to its parameters and expressing it as a Jacobian matrix. The gradient can then be represented as the matrix product of these Jacobian matrices. This approach is valid because the chain rule can be applied to a composition of vector-valued functions, and the use of Jacobian matrices allows for the incorporation of multiple inputs and outputs. By providing concise mathematical justifications, the results can be made understandable and useful to a broad audience from various disciplines.  ( 2 min )
    Probabilistic Bilevel Coreset Selection. (arXiv:2301.09880v1 [cs.LG])
    The goal of coreset selection in supervised learning is to produce a weighted subset of data, so that training only on the subset achieves similar performance as training on the entire dataset. Existing methods achieved promising results in resource-constrained scenarios such as continual learning and streaming. However, most of the existing algorithms are limited to traditional machine learning models. A few algorithms that can handle large models adopt greedy search approaches due to the difficulty in solving the discrete subset selection problem, which is computationally costly when coreset becomes larger and often produces suboptimal results. In this work, for the first time we propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample. The overall objective is posed as a bilevel optimization problem, where 1) the inner loop samples coresets and train the model to convergence and 2) the outer loop updates the sample probability progressively according to the model's performance. Importantly, we develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation. We provide the convergence property of our training procedure and demonstrate the superiority of our algorithm against various coreset selection methods in various tasks, especially in more challenging label-noise and class-imbalance scenarios.  ( 2 min )
    A two stages Deep Learning Architecture for Model Reduction of Parametric Time-Dependent Problems. (arXiv:2301.09926v1 [math.NA])
    Parametric time-dependent systems are of a crucial importance in modeling real phenomena, often characterized by non-linear behaviors too. Those solutions are typically difficult to generalize in a sufficiently wide parameter space while counting on limited computational resources available. As such, we present a general two-stages deep learning framework able to perform that generalization with low computational effort in time. It consists in a separated training of two pipe-lined predictive models. At first, a certain number of independent neural networks are trained with data-sets taken from different subsets of the parameter space. Successively, a second predictive model is specialized to properly combine the first-stage guesses and compute the right predictions. Promising results are obtained applying the framework to incompressible Navier-Stokes equations in a cavity (Rayleigh-Bernard cavity), obtaining a 97% reduction in the computational time comparing with its numerical resolution for a new value of the Grashof number.  ( 2 min )
    Neighborhood Homophily-Guided Graph Convolutional Network. (arXiv:2301.09851v1 [cs.LG])
    Graph neural networks (GNNs) have achieved remarkable advances in graph-oriented tasks. However, many real-world graphs contain heterophily or low homophily, challenging the homophily assumption of classical GNNs and resulting in low performance. Although many studies have emerged to improve the universality of GNNs, they rarely consider the label reuse and the correlation of their proposed metrics and models. In this paper, we first design a new metric, named Neighborhood Homophily (\textit{NH}), to measure the label complexity or purity in the neighborhood of nodes. Furthermore, we incorporate this metric into the classical graph convolutional network (GCN) architecture and propose \textbf{N}eighborhood \textbf{H}omophily-\textbf{G}uided \textbf{G}raph \textbf{C}onvolutional \textbf{N}etwork (\textbf{NHGCN}). In this framework, nodes are grouped by estimated \textit{NH} values to achieve intra-group weight sharing during message propagation and aggregation. Then the generated node predictions are used to estimate and update new \textit{NH} values. The two processes of metric estimation and model inference are alternately optimized to achieve better node classification. Extensive experiments on both homophilous and heterophilous benchmarks demonstrate that \textbf{NHGCN} achieves state-of-the-art overall performance on semi-supervised node classification for the universality problem.  ( 2 min )
    Learning To Dive In Branch And Bound. (arXiv:2301.09943v1 [cs.LG])
    Primal heuristics are important for solving mixed integer linear programs, because they find feasible solutions that facilitate branch and bound search. A prominent group of primal heuristics are diving heuristics. They iteratively modify and resolve linear programs to conduct a depth-first search from any node in the search tree. Existing divers rely on generic decision rules that fail to exploit structural commonality between similar problem instances that often arise in practice. Therefore, we propose L2Dive to learn specific diving heuristics with graph neural networks: We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions based on the model's predictions. L2Dive is fully integrated into the open-source solver SCIP. We find that L2Dive outperforms standard divers to find better feasible solutions on a range of combinatorial optimization problems. For real-world applications from server load balancing and neural network verification, L2Dive improves the primal-dual integral by up to 7% (35%) on average over a tuned (default) solver baseline and reduces average solving time by 20% (29%).  ( 2 min )
    Same or Different? Diff-Vectors for Authorship Analysis. (arXiv:2301.09862v1 [cs.LG])
    We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship verification) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call Diff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks. (arXiv:2301.09939v1 [cs.CE])
    This paper presents a new approach which uses the tools within Artificial Intelligence (AI) software libraries as an alternative way of solving partial differential equations (PDEs) that have been discretised using standard numerical methods. In particular, we describe how to represent numerical discretisations arising from the finite volume and finite element methods by pre-determining the weights of convolutional layers within a neural network. As the weights are defined by the discretisation scheme, no training of the network is required and the solutions obtained are identical (accounting for solver tolerances) to those obtained with standard codes often written in Fortran or C++. We also explain how to implement the Jacobi method and a multigrid solver using the functions available in AI libraries. For the latter, we use a U-Net architecture which is able to represent a sawtooth multigrid method. A benefit of using AI libraries in this way is that one can exploit their power and their built-in technologies. For example, their executions are already optimised for different computer architectures, whether it be CPUs, GPUs or new-generation AI processors. In this article, we apply the proposed approach to eigenvalue problems in reactor physics where neutron transport is described by diffusion theory. For a fuel assembly benchmark, we demonstrate that the solution obtained from our new approach is the same (accounting for solver tolerances) as that obtained from the same discretisation coded in a standard way using Fortran. We then proceed to solve a reactor core benchmark using the new approach.  ( 2 min )
    A Stability Analysis of Fine-Tuning a Pre-Trained Model. (arXiv:2301.09820v1 [cs.LG])
    Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Noisy Parallel Data Alignment. (arXiv:2301.09685v1 [cs.CL])
    An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.  ( 2 min )
    Data Augmentation Alone Can Improve Adversarial Training. (arXiv:2301.09879v1 [cs.CV])
    Adversarial training suffers from the issue of robust overfitting, which seriously impairs its generalization performance. Data augmentation, which is effective at preventing overfitting in standard training, has been observed by many previous works to be ineffective in mitigating overfitting in adversarial training. This work proves that, contrary to previous findings, data augmentation alone can significantly boost accuracy and robustness in adversarial training. We find that the hardness and the diversity of data augmentation are important factors in combating robust overfitting. In general, diversity can improve both accuracy and robustness, while hardness can boost robustness at the cost of accuracy within a certain limit and degrade them both over that limit. To mitigate robust overfitting, we first propose a new crop transformation, Cropshift, which has improved diversity compared to the conventional one (Padcrop). We then propose a new data augmentation scheme, based on Cropshift, with much improved diversity and well-balanced hardness. Empirically, our augmentation method achieves the state-of-the-art accuracy and robustness for data augmentations in adversarial training. Furthermore, when combined with weight averaging it matches, or even exceeds, the performance of the best contemporary regularization methods for alleviating robust overfitting. Code is available at: https://github.com/TreeLLi/DA-Alone-Improves-AT.  ( 2 min )
    Topological Structure is Predictive of Deep Neural Network Success in Learning. (arXiv:2301.09734v1 [cs.LG])
    Machine learning has become a fundamental tool in modern science, yet its limitations are still not fully understood. Using a simple children's game, we show that the topological structure of the underlying training data can have a dramatic effect on the ability of a deep neural network (DNN) classifier to learn to classify data. We then take insights obtained from this toy model and apply them to two physical data sets (one from particle physics and one from acoustics), which are known to be amenable to classification by DNN's. We show that the simplicity in their topological structure explains the majority of the DNN's ability to operate on these data sets by showing that fully interpretable topological classifiers are able to perform nearly as well as their DNN counterparts.  ( 2 min )
    Slice-and-Forge: Making Better Use of Caches for Graph Convolutional Network Accelerators. (arXiv:2301.09813v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they can process a wide variety of data formats that prior deep neural networks cannot easily support. One key challenge in designing hardware accelerators for GCNs is the vast size and randomness in their data access patterns which greatly reduces the effectiveness of the limited on-chip cache. Aimed at improving the effectiveness of the cache by mitigating the irregular data accesses, prior studies often employ the vertex tiling techniques used in traditional graph processing applications. While being effective at enhancing the cache efficiency, those approaches are often sensitive to the tiling configurations where the optimal setting heavily depends on target input datasets. Furthermore, the existing solutions require manual tuning through trial-and-error or rely on sub-optimal analytical models. In this paper, we propose Slice-and-Forge (SnF), an efficient hardware accelerator for GCNs which greatly improves the effectiveness of the limited on-chip cache. SnF chooses a tiling strategy named feature slicing that splits the features into vertical slices and processes them in the outermost loop of the execution. This particular choice results in a repetition of the identical computational patterns over irregular graph data over multiple rounds. Taking advantage of such repetitions, SnF dynamically tunes its tile size. Our experimental results reveal that SnF can achieve 1.73x higher performance in geomean compared to prior work on multi-engine settings, and 1.46x higher performance in geomean on small scale settings, without the need for off-line analyses.  ( 2 min )
    Gossiped and Quantized Online Multi-Kernel Learning. (arXiv:2301.09848v1 [cs.LG])
    In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, past research has shown that distributed and online multi-kernel learning provides sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). In addition, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. This letter expands on these results to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, we propose a gossip algorithm and provide a proof that it achieves sub-linear regret. Experiments with real datasets confirm our findings.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Backdoor Attacks in Peer-to-Peer Federated Learning. (arXiv:2301.09732v1 [cs.LG])
    We study backdoor attacks in peer-to-peer federated learning systems on different graph topologies and datasets. We show that only 5% attacker nodes are sufficient to perform a backdoor attack with 42% attack success without decreasing the accuracy on clean data by more than 2%. We also demonstrate that the attack can be amplified by the attacker crashing a small number of nodes. We evaluate defenses proposed in the context of centralized federated learning and show they are ineffective in peer-to-peer settings. Finally, we propose a defense that mitigates the attacks by applying different clipping norms to the model updates received from peers and local model trained by a node.  ( 2 min )
    Truveta Mapper: A Zero-shot Ontology Alignment Framework. (arXiv:2301.09767v1 [cs.LG])
    In this paper, a new perspective is suggested for unsupervised Ontology Matching (OM) or Ontology Alignment (OA) by treating it as a translation task. Ontologies are represented as graphs, and the translation is performed from a node in the source ontology graph to a path in the target ontology graph. The proposed framework, Truveta Mapper (TM), leverages a multi-task sequence-to-sequence transformer model to perform alignment across multiple ontologies in a zero-shot, unified and end-to-end manner. Multi-tasking enables the model to implicitly learn the relationship between different ontologies via transfer-learning without requiring any explicit cross-ontology manually labeled data. This also enables the formulated framework to outperform existing solutions for both runtime latency and alignment quality. The model is pre-trained and fine-tuned only on publicly available text corpus and inner-ontologies data. The proposed solution outperforms state-of-the-art approaches, Edit-Similarity, LogMap, AML, BERTMap, and the recently presented new OM frameworks in Ontology Alignment Evaluation Initiative (OAEI22), offers log-linear complexity in contrast to quadratic in the existing end-to-end methods, and overall makes the OM task efficient and more straightforward without much post-processing involving mapping extension or mapping repair.  ( 2 min )
    Constrained Reinforcement Learning for Dexterous Manipulation. (arXiv:2301.09766v1 [cs.RO])
    Existing learning approaches to dexterous manipulation use demonstrations or interactions with the environment to train black-box neural networks that provide little control over how the robot learns the skills or how it would perform post training. These approaches pose significant challenges when implemented on physical platforms given that, during initial stages of training, the robot's behavior could be erratic and potentially harmful to its own hardware, the environment, or any humans in the vicinity. A potential way to address these limitations is to add constraints during learning that restrict and guide the robot's behavior during training as well as roll outs. Inspired by the success of constrained approaches in other domains, we investigate the effects of adding position-based constraints to a 24-DOF robot hand learning to perform object relocation using Constrained Policy Optimization. We find that a simple geometric constraint can ensure the robot learns to move towards the object sooner than without constraints. Further, training with this constraint requires a similar number of samples as its unconstrained counterpart to master the skill. These findings shed light on how simple constraints can help robots achieve sensible and safe behavior quickly and ease concerns surrounding hardware deployment. We also investigate the effects of the strictness of these constraints and report findings that provide insights into how different degrees of strictness affect learning outcomes. Our code is available at https://github.com/GT-STAR-Lab/constrained-rl-dexterous-manipulation.  ( 2 min )
    DODEM: DOuble DEfense Mechanism Against Adversarial Attacks Towards Secure Industrial Internet of Things Analytics. (arXiv:2301.09740v1 [cs.CR])
    Industrial Internet of Things (I-IoT) is a collaboration of devices, sensors, and networking equipment to monitor and collect data from industrial operations. Machine learning (ML) methods use this data to make high-level decisions with minimal human intervention. Data-driven predictive maintenance (PDM) is a crucial ML-based I-IoT application to find an optimal maintenance schedule for industrial assets. The performance of these ML methods can seriously be threatened by adversarial attacks where an adversary crafts perturbed data and sends it to the ML model to deteriorate its prediction performance. The models should be able to stay robust against these attacks where robustness is measured by how much perturbation in input data affects model performance. Hence, there is a need for effective defense mechanisms that can protect these models against adversarial attacks. In this work, we propose a double defense mechanism to detect and mitigate adversarial attacks in I-IoT environments. We first detect if there is an adversarial attack on a given sample using novelty detection algorithms. Then, based on the outcome of our algorithm, marking an instance as attack or normal, we select adversarial retraining or standard training to provide a secondary defense layer. If there is an attack, adversarial retraining provides a more robust model, while we apply standard training for regular samples. Since we may not know if an attack will take place, our adaptive mechanism allows us to consider irregular changes in data. The results show that our double defense strategy is highly efficient where we can improve model robustness by up to 64.6% and 52% compared to standard and adversarial retraining, respectively.  ( 2 min )
    Long-term stable Electromyography classification using Canonical Correlation Analysis. (arXiv:2301.09729v1 [cs.LG])
    Discrimination of hand gestures based on the decoding of surface electromyography (sEMG) signals is a well-establish approach for controlling prosthetic devices and for Human-Machine Interfaces (HMI). However, despite the promising results achieved by this approach in well-controlled experimental conditions, its deployment in long-term real-world application scenarios is still hindered by several challenges. One of the most critical challenges is maintaining high EMG data classification performance across multiple days without retraining the decoding system. The drop in performance is mostly due to the high EMG variability caused by electrodes shift, muscle artifacts, fatigue, user adaptation, or skin-electrode interfacing issues. Here we propose a novel statistical method based on canonical correlation analysis (CCA) that stabilizes EMG classification performance across multiple days for long-term control of prosthetic devices. We show how CCA can dramatically decrease the performance drop of standard classifiers observed across days, by maximizing the correlation among multiple-day acquisition data sets. Our results show how the performance of a classifier trained on EMG data acquired only of the first day of the experiment maintains 90% relative accuracy across multiple days, compensating for the EMG data variability that occurs over long-term periods, using the CCA transformation on data obtained from a small number of gestures. This approach eliminates the need for large data sets and multiple or periodic training sessions, which currently hamper the usability of conventional pattern recognition based approaches  ( 2 min )
    Earthquake Magnitude and b value prediction model using Extreme Learning Machine. (arXiv:2301.09756v1 [physics.geo-ph])
    Earthquake prediction has been a challenging research area for many decades, where the future occurrence of this highly uncertain calamity is predicted. In this paper, several parametric and non-parametric features were calculated, where the non-parametric features were calculated using the parametric features. $8$ seismic features were calculated using Gutenberg-Richter law, the total recurrence, and the seismic energy release. Additionally, criterions such as Maximum Relevance and Maximum Redundancy were applied to choose the pertinent features. These features along with others were used as input for an Extreme Learning Machine (ELM) Regression Model. Magnitude and time data of $5$ decades from the Assam-Guwahati region were used to create this model for magnitude prediction. The Testing Accuracy and Testing Speed were computed taking the Root Mean Squared Error (RMSE) as the parameter for evaluating the mode. As confirmed by the results, ELM shows better scalability with much faster training and testing speed (up to a thousand times faster) than traditional Support Vector Machines. The testing RMSE came out to be around $0.097$. To further test the model's robustness -- magnitude-time data from California was used to calculate the seismic indicators which were then fed into an ELM and then tested on the Assam-Guwahati region. The model proves to be robust and can be implemented in early warning systems as it continues to be a major part of Disaster Response and management.  ( 2 min )
    Two-Stage Learning For the Flexible Job Shop Scheduling Problem. (arXiv:2301.09703v1 [cs.AI])
    The Flexible Job-shop Scheduling Problem (FJSP) is an important combinatorial optimization problem that arises in manufacturing and service settings. FJSP is composed of two subproblems, an assignment problem that assigns tasks to machines, and a scheduling problem that determines the starting times of tasks on their chosen machines. Solving FJSP instances of realistic size and composition is an ongoing challenge even under simplified, deterministic assumptions. Motivated by the inevitable randomness and uncertainties in supply chains, manufacturing, and service operations, this paper investigates the potential of using a deep learning framework to generate fast and accurate approximations for FJSP. In particular, this paper proposes a two-stage learning framework 2SLFJSP that explicitly models the hierarchical nature of FJSP decisions, uses a confidence-aware branching scheme to generate appropriate instances for the scheduling stage from the assignment predictions and leverages a novel symmetry-breaking formulation to improve learnability. 2SL-FJSP is evaluated on instances from the FJSP benchmark library. Results show that 2SL-FJSP can generate high-quality solutions in milliseconds, outperforming a state-of-the-art reinforcement learning approach recently proposed in the literature, and other heuristics commonly used in practice.  ( 2 min )
    Implementation of the Critical Wave Groups Method with Computational Fluid Dynamics and Neural Networks. (arXiv:2301.09834v1 [physics.flu-dyn])
    Accurate and efficient prediction of extreme ship responses continues to be a challenging problem in ship hydrodynamics. Probabilistic frameworks in conjunction with computationally efficient numerical hydrodynamic tools have been developed that allow researchers and designers to better understand extremes. However, the ability of these hydrodynamic tools to represent the physics quantitatively during extreme events is limited. Previous research successfully implemented the critical wave groups (CWG) probabilistic method with computational fluid dynamics (CFD). Although the CWG method allows for less simulation time than a Monte Carlo approach, the large quantity of simulations required is cost prohibitive. The objective of the present paper is to reduce the computational cost of implementing CWG with CFD, through the construction of long short-term memory (LSTM) neural networks. After training the models with a limited quantity of simulations, the models can provide a larger quantity of predictions to calculate the probability. The new framework is demonstrated with a 2-D midship section of the Office of Naval Research Tumblehome (ONRT) hull in Sea State 7 and beam seas at zero speed. The new framework is able to produce predictions that are representative of a purely CFD-driven CWG framework, with two orders of magnitude of computational cost savings.  ( 2 min )
    Long-tail Detection with Effective Class-Margins. (arXiv:2301.09724v1 [cs.CV])
    Large-scale object detection and instance segmentation face a severe data imbalance. The finer-grained object classes become, the less frequent they appear in our datasets. However, at test-time, we expect a detector that performs well for all classes and not just the most frequent ones. In this paper, we provide a theoretical understanding of the long-trail detection problem. We show how the commonly used mean average precision evaluation metric on an unknown test set is bound by a margin-based binary classification error on a long-tailed object detection training set. We optimize margin-based binary classification error with a novel surrogate objective called \textbf{Effective Class-Margin Loss} (ECM). The ECM loss is simple, theoretically well-motivated, and outperforms other heuristic counterparts on LVIS v1 benchmark over a wide range of architecture and detectors. Code is available at \url{https://github.com/janghyuncho/ECM-Loss}.  ( 2 min )
    Graph Neural Networks for Decentralized Multi-Agent Perimeter Defense. (arXiv:2301.09689v1 [cs.MA])
    In this work, we study the problem of decentralized multi-agent perimeter defense that asks for computing actions for defenders with local perceptions and communications to maximize the capture of intruders. One major challenge for practical implementations is to make perimeter defense strategies scalable for large-scale problem instances. To this end, we leverage graph neural networks (GNNs) to develop an imitation learning framework that learns a mapping from defenders' local perceptions and their communication graph to their actions. The proposed GNN-based learning network is trained by imitating a centralized expert algorithm such that the learned actions are close to that generated by the expert algorithm. We demonstrate that our proposed network performs closer to the expert algorithm and is superior to other baseline algorithms by capturing more intruders. Our GNN-based network is trained at a small scale and can be generalized to large-scale cases. We run perimeter defense games in scenarios with different team sizes and configurations to demonstrate the performance of the learned network.  ( 2 min )
    Illumination Variation Correction Using Image Synthesis For Unsupervised Domain Adaptive Person Re-Identification. (arXiv:2301.09702v1 [eess.IV])
    Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.  ( 2 min )
    PRIMEQA: The Prime Repository for State-of-the-Art MultilingualQuestion Answering Research and Development. (arXiv:2301.09715v1 [cs.CL])
    The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.  ( 2 min )
    On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation. (arXiv:2301.09709v1 [cs.LG])
    A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.  ( 2 min )
    Weakly-Supervised Questions for Zero-Shot Relation Extraction. (arXiv:2301.09640v1 [cs.CL])
    Zero-Shot Relation Extraction (ZRE) is the task of Relation Extraction where the training and test sets have no shared relation types. This very challenging domain is a good test of a model's ability to generalize. Previous approaches to ZRE reframed relation extraction as Question Answering (QA), allowing for the use of pre-trained QA models. However, this method required manually creating gold question templates for each new relation. Here, we do away with these gold templates and instead learn a model that can generate questions for unseen relations. Our technique can successfully translate relation descriptions into relevant questions, which are then leveraged to generate the correct tail entity. On tail entity extraction, we outperform the previous state-of-the-art by more than 16 F1 points without using gold question templates. On the RE-QA dataset where no previous baseline for relation extraction exists, our proposed algorithm comes within 0.7 F1 points of a system that uses gold question templates. Our model also outperforms the state-of-the-art ZRE baselines on the FewRel and WikiZSL datasets, showing that QA models no longer need template questions to match the performance of models specifically tailored to the ZRE task. Our implementation is available at https://github.com/fyshelab/QA-ZRE.  ( 2 min )
    Selective Explanations: Leveraging Human Input to Align Explainable AI. (arXiv:2301.09656v1 [cs.AI])
    While a vast collection of explainable AI (XAI) algorithms have been developed in recent years, they are often criticized for significant gaps with how humans produce and consume explanations. As a result, current XAI techniques are often found to be hard to use and lack effectiveness. In this work, we attempt to close these gaps by making AI explanations selective -- a fundamental property of human explanations -- by selectively presenting a subset from a large set of model reasons based on what aligns with the recipient's preferences. We propose a general framework for generating selective explanations by leveraging human input on a small sample. This framework opens up a rich design space that accounts for different selectivity goals, types of input, and more. As a showcase, we use a decision-support task to explore selective explanations based on what the decision-maker would consider relevant to the decision task. We conducted two experimental studies to examine three out of a broader possible set of paradigms based on our proposed framework: in Study 1, we ask the participants to provide their own input to generate selective explanations, with either open-ended or critique-based input. In Study 2, we show participants selective explanations based on input from a panel of similar users (annotators). Our experiments demonstrate the promise of selective explanations in reducing over-reliance on AI and improving decision outcomes and subjective perceptions of the AI, but also paint a nuanced picture that attributes some of these positive effects to the opportunity to provide one's own input to augment AI explanations. Overall, our work proposes a novel XAI framework inspired by human communication behaviors and demonstrates its potentials to encourage future work to better align AI explanations with human production and consumption of explanations.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints. (arXiv:2301.09642v1 [q-bio.QM])
    Have you ever been troubled by the complexity and computational cost of SE(3) protein structure modeling and been amazed by the simplicity and power of language modeling? Recent work has shown promise in simplifying protein structures as sequences of protein angles; therefore, language models could be used for unconstrained protein backbone generation. Unfortunately, such simplification is unsuitable for the constrained protein inpainting problem, where the model needs to recover masked structures conditioned on unmasked ones, as it dramatically increases the computing cost of geometric constraints. To overcome this dilemma, we suggest inserting a hidden \textbf{a}tomic \textbf{d}irection \textbf{s}pace (\textbf{ADS}) upon the language model, converting invariant backbone angles into equivalent direction vectors and preserving the simplicity, called Seq2Direct encoder ($\text{Enc}_{s2d}$). Geometric constraints could be efficiently imposed on the newly introduced direction space. A Direct2Seq decoder ($\text{Dec}_{d2s}$) with mathematical guarantees is also introduced to develop a \textbf{SDS} ($\text{Enc}_{s2d}$+$\text{Dec}_{d2s}$) model. We apply the SDS model as the denoising neural network during the conditional diffusion process, resulting in a constrained generative model--\textbf{DiffSDS}. Extensive experiments show that the plug-and-play ADS could transform the language model into a strong structural model without loss of simplicity. More importantly, the proposed DiffSDS outperforms previous strong baselines by a large margin on the task of protein inpainting.  ( 2 min )
  • Open

    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.  ( 2 min )
    Multiway Spherical Clustering via Degree-Corrected Tensor Block Models. (arXiv:2201.07401v2 [math.ST] UPDATED)
    We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.  ( 2 min )
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking. (arXiv:2104.02943v3 [math.ST] UPDATED)
    The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be viewed as summaries of the ROC curve. In this paper, the fact that most of these empirical criteria can be expressed as two-sample linear rank statistics is highlighted and concentration inequalities for collections of such random variables, referred to as two-sample rank processes here, are proved, when indexed by VC classes of scoring functions. Based on these nonasymptotic bounds, the generalization capacity of empirical maximizers of a wide class of ranking performance criteria is next investigated from a theoretical perspective. It is also supported by empirical evidence through convincing numerical experiments.  ( 2 min )
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Forecasting the 2016-2017 Central Apennines Earthquake Sequence with a Neural Point Process. (arXiv:2301.09948v1 [physics.geo-ph])
    Point processes have been dominant in modeling the evolution of seismicity for decades, with the Epidemic Type Aftershock Sequence (ETAS) model being most popular. Recent advances in machine learning have constructed highly flexible point process models using neural networks to improve upon existing parametric models. We investigate whether these flexible point process models can be applied to short-term seismicity forecasting by extending an existing temporal neural model to the magnitude domain and we show how this model can forecast earthquakes above a target magnitude threshold. We first demonstrate that the neural model can fit synthetic ETAS data, however, requiring less computational time because it is not dependent on the full history of the sequence. By artificially emulating short-term aftershock incompleteness in the synthetic dataset, we find that the neural model outperforms ETAS. Using a new enhanced catalog from the 2016-2017 Central Apennines earthquake sequence, we investigate the predictive skill of ETAS and the neural model with respect to the lowest input magnitude. Constructing multiple forecasting experiments using the Visso, Norcia and Campotosto earthquakes to partition training and testing data, we target M3+ events. We find both models perform similarly at previously explored thresholds (e.g., above M3), but lowering the threshold to M1.2 reduces the performance of ETAS unlike the neural model. We argue that some of these gains are due to the neural model's ability to handle incomplete data. The robustness to missing data and speed to train the neural model present it as an encouraging competitor in earthquake forecasting.  ( 2 min )
    Adaptive Probabilistic Forecasting of Electricity (Net-)Load. (arXiv:2301.10090v1 [stat.AP])
    We focus on electricity load forecasting under three important specificities. First, our setting is adaptive; we use models taking into account the most recent observations available, yielding a forecasting strategy able to automatically respond to regime changes. Second, we consider probabilistic rather than point forecasting; indeed, uncertainty quantification is required to operate electricity systems efficiently and reliably. Third, we consider both conventional load (consumption only) and netload (consumption less embedded generation). Our methodology relies on the Kalman filter, previously used successfully for adaptive point load forecasting. The probabilistic forecasts are obtained by quantile regressions on the residuals of the point forecasting model. We achieve adaptive quantile regressions using the online gradient descent; we avoid the choice of the gradient step size considering multiple learning rates and aggregation of experts. We apply the method to two data sets: the regional net-load in Great Britain and the demand of seven large cities in the United States. Adaptive procedures improve forecast performance substantially in both use cases and for both point and probabilistic forecasting.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    Improved Rate of First Order Algorithms for Entropic Optimal Transport. (arXiv:2301.09675v1 [math.OC])
    This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. The resulting rate for approximating the optimal transport (OT) has been improved from $\widetilde{{O}}({n^{2.5}}/{\epsilon})$ to $\widetilde{{O}}({n^2}/{\epsilon})$, where $n$ is the problem size and $\epsilon$ is the accuracy level. In particular, we propose an accelerated primal-dual stochastic mirror descent algorithm with variance reduction. Such special design helps us improve the rate compared to other accelerated primal-dual algorithms. We further propose a batch version of our stochastic algorithm, which improves the computational performance through parallel computing. To compare, we prove that the computational complexity of the Stochastic Sinkhorn algorithm is $\widetilde{{O}}({n^2}/{\epsilon^2})$, which is slower than our accelerated primal-dual stochastic mirror algorithm. Experiments are done using synthetic and real data, and the results match our theoretical rates. Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $\widetilde{{O}}({n^2}/{\epsilon})$ for solving OT.  ( 2 min )

  • Open

    Just found a new chrome extension called IntelliMail that uses AI to write emails. Its super easy to use and can be used to land internships, jobs and up your email game.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I created an 'AI Tools' series on YouTube and I'd love your feedback! Today is About Jasper AI
    I would love to know your opinion about what I should improve on mt videos and of course if you don't know about Jasper AI give a look, it's a great AI tool for Content Creation https://youtu.be/x_6rzsBVABg submitted by /u/sigmabruuh [link] [comments]  ( 6 min )
    OpenAi's breakthrough
    https://twitter.com/make_mhe/status/1618255363580755968 submitted by /u/bradasm [link] [comments]  ( 40 min )
    I asked an AI image generator to show me a "typical Discord user"
    submitted by /u/MSAPW [link] [comments]  ( 40 min )
    AI Dream 150 - SUPERNOVA IMMINENT Part1 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    I am recommending you something
    It's an amazing app called Simplywrite. It gives you 20 credits for free. It lets you generate articles and outlines. Totally a free tool to use when you bored. LINK TO APP: https://simplywrite.com submitted by /u/benignkirby [link] [comments]  ( 40 min )
    Anonymize tabular data to meet GDPR privacy requirements
    submitted by /u/Repeat-or [link] [comments]  ( 40 min )
    I made a list of tools powered by AI
    submitted by /u/Alen0tv [link] [comments]  ( 40 min )
    How to automatically generate test cases in NLP?
    submitted by /u/Nazma2015 [link] [comments]  ( 40 min )
    Best TTS for cheesy movie trailer guy voice?
    Looking for a TTS that can generate that really deep movie trailer male voice. Any ideas? submitted by /u/infinitycurvature [link] [comments]  ( 40 min )
    Best video style transfer program as of now?
    Looking something where you can capture a face speaking and replace it with a completely different character (including neck, face, shoulders). Looking to replace a human with an animated character. What options are there as of now? I know ebsynth, but it's not too good. There is vtuber software, but they look too "anime". Looking for something that would give a very "photorealistic" result, think Avatar quality output or a Pixar movie. What is the best that's out there right now? submitted by /u/UpperStruggle2421 [link] [comments]  ( 41 min )
    Beware Loab, the digital cryptid lurking in AI's forgotten space
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    Will coders and writers just be doing QA of AI output?
    So 5 years ago, the main career and hobby advice I had was "look to see if programming is a thing for you, it's at least a fun hobby and maybe you can find a niche as one of the last of the custom craftspeople" But now... where do you go, in a society where wealth tends to get sucked upward, to find a good living? Do you learn to roll with being a bot manager? (Assuming the AIs don't just manage themselves?) Do you look for trades of manipulating physical stuff that haven't yet lent themselves to cybernetics? submitted by /u/kirk_is_ [link] [comments]  ( 44 min )
    merge photo with ai
    hi everyone now i have two different photos (not image) and how can i merge thats with using ai. submitted by /u/Aigerim_D40 [link] [comments]  ( 40 min )
    What is your preferred Image Generation API / App?
    It is really difficult to benchmark Text-to-Image AIs, it relies on so many aspects: speed, styles, precision of the prompt, interface, fine-tuning, etc. So I think the best approach is to see which are the most prefered by the people who use Stable Diffusion API. Do not hesitate to explain your choice in comments, and also mention APIs that are not in the Poll, I am limited to 6 options... I know that I did not put Midjourney, Artbreeder, Stable Diffusion, NightCafe, Crayion, Starry AI and many other but I am interested in those which provide API only. PS: this isn't promotional at all, I am not working for any of those companies. View Poll submitted by /u/JerLam2762 [link] [comments]  ( 42 min )
    I asked an AI image generator to show me a "typical Reddit user"
    submitted by /u/MrsChenHW [link] [comments]  ( 41 min )
    Yann LeCun, Meta’s Chief AI Scientist, Has Some Harsh Criticism Of ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    A New Free AI Model InstructPix2Pix To Transform Images By Plain English Instructions
    submitted by /u/CeFurkan [link] [comments]  ( 40 min )
    Fearing ChatGPT, Google enlists founders Brin and Page in AI fight
    submitted by /u/SAT0725 [link] [comments]  ( 40 min )
    Being really humorous under the pressure of billions of prompt requests
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Humanity's Quest to Decode Animal Languages Through AI
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    The Connection Between Science Fiction and Artificial Intelligence: A Survey Study
    Hello everyone, I am a student in AP Research. For my project, I am conducting a survey to analyze the connection between science fiction and technology (specifically Artificial Intelligence). This survey (linked) asks a few questions about your knowledge of Sci-fi, Artificial Intelligence, and the connection between the two. It should not take more than 10 minutes of your time. If you are interested, the link to the form is below: https://docs.google.com/forms/d/e/1FAIpQLScY_VaNI-CEtTiJiLHgYCCguEZ7m9DUdQoxvFTjXFFLOGu2KA/viewform If you have additional questions, my email is in the linked google form. Thank you for your participation, it is deeply appreciated! submitted by /u/rsantos05 [link] [comments]  ( 41 min )
  • Open

    [D] Alphatensor benchmark code in Colab
    Hello everybody I was wondering if anybody tried to run the main factorisation code https://github.com/deepmind/alphatensor/blob/main/benchmarking/factorizations.py from alpha tensor on Google Colab, with Colab's GPUs ( Tesla T4). I know that Tesla T4 is not as the same as the V100 used in Deep Mind's paper, however, I can see that the tensor formulation for the matrix multiplication is highly inefficient, compared to standard JAX matrix multiplication. Any suggestion where am I wrong? submitted by /u/IndependentIce4553 [link] [comments]  ( 42 min )
    [N] Upcoming talk: "Open Problems in Deep Neural Networks: An Optimal Control Perspective"
    Open Problems in Deep Neural Networks: An Optimal Control Perspective Feb 13, 6:30 ET About the talk: Backpropagation is a widely used algorithm for training neural networks. Its key step, Stochastic Gradient Descent (SGD) has become one of the bedrocks of deep learning. Despite wide adoption, mathematically rigorous study of SGD's convergence for deep neural networks is still ongoing. Join us as graduate student Amoolya Tirumalai discusses an approach to the convergence problem inspired by optimal control theory. Following the Pontryargin maximum principle, an alternative forward-backward iterative system will be described. Toy examples will be shown, and problems in robustness and security will be discussed. About the speaker: Amoolya Tirumalai is a 4th year PhD Candidate in Electrical Engineering at the University of Maryland, College Park. His interests are (robust) optimal control, partial differential equations, differential games, mean-field games, safety-critical control, and (robust) machine learning. His thesis titled 'Multi-agent inference, decision-making and control: models, structure and performance evaluation' will be defended in August 2023. Mr. Tirumalai was conferred a BS in Biomedical Engineering from the Georgia Institute of Technology in 2018. submitted by /u/what_comes_next [link] [comments]  ( 43 min )
    [D] Publication Resume
    If we submit a publication to ICML and it is under anonymous review, can I list the title and authors on my resume which will be on my personal webpage? submitted by /u/BigDreamx [link] [comments]  ( 43 min )
    [R] Blogpost on comparing Chatbots like ChatGPT, LaMDA, Sparrow, BlenderBot 3, and Claude
    https://huggingface.co/blog/dialog-agents breaks down the techniques behind ChatGPT -- instruction fine-tuning, supervised fine-tuning, chain-of-thought, read teaming, and more. https://preview.redd.it/fv16fsemd9ea1.png?width=889&format=png&auto=webp&s=a8f24de27c40a946fec64eaa674f81ddef0d0cc3 submitted by /u/emailnazneen [link] [comments]  ( 42 min )
    [D] Efficient retrieval of research information for graduate research
    I have lot of notes about research papers in a particular directory and the number of files has started to become larger than what I can remember off the top of my head. It will continue to keep growing and I have begun to wonder the most efficient way to retrieve the information. I could use ripgrep and regular expressions to find the notes efficiently, but I imagine that if the database is very huge and I don't have the correct regular expression in use, then I might not retrieve the correct files. Inspired by chatGPT, I was impressed at how it presents info from the internet and speeds up my time for finding information even when I do not know the correct keywords. I figured a NLP model primarily trained on my database would be an easier task and I was wondering if someone had already created something like this as open source or how would they go about it? submitted by /u/waterstrider123 [link] [comments]  ( 46 min )
    [D] Accurate data or more data?
    If you are building a model and had the choice, would you prefer more accurate (~99%) but less data or a lot more data but less accurate (~90%)? submitted by /u/NoSympathy9787 [link] [comments]  ( 43 min )
    [R] Best service for scientific paper correction
    Hello, Anyone ever used a paper revision service and can recommend one ? I’m publishing my first paper next month and I want to have feedback from an expert on this domain. Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 43 min )
    [D] Self-Supervised Contrastive Approaches that don’t use large batch size.
    This thread is dedicated to exploring the various techniques used in self-supervised contrastive learning that utilize standard batch sizes. I am seeking information on the current methods in this field, specifically those that do not rely on large batch sizes. I am familiar with the SimSiam paper published by META research, which utilizes 256 batch size for 8-GPUs. However, for individuals with limited resources such as myself, access to a large number of GPUs may not be feasible. As a result, I am interested in learning about other methods that can be used with smaller batch sizes and a single GPU, such as those that would be suitable for training on 1024x1024 input images. Additionally, I am curious about any more efficient architectures that have been developed in this field. This includes, but is not limited to, techniques used in natural language processing that may have applications in other areas of artificial intelligence. ***posted the same question in PyTorch forums, reposting here for wider reach. submitted by /u/shingekichan1996 [link] [comments]  ( 46 min )
    [R] Tsetlin Machine in Medical Research - Striking Differences Between Tsetlin Machine Interpretability and Deep Learning Attention
    ​ Tsetlin machine interpretability vs deep learning attention. Researchers at West China Hospital, Sichuan University, NORCE, and UiA have developed a Tsetlin machine-based architecture for premature ventricular contraction identification by analyzing long-term ECG signals. The experiments show that the Tsetlin machine is capable of producing human-interpretable rules, consistent with the clinical standard and medical knowledge. Simultaneously, the accuracy was comparable with deep CNN-based models. Paper: https://arxiv.org/abs/2301.10181 submitted by /u/olegranmo [link] [comments]  ( 43 min )
    [R] INSTRUCTOR One Embedder , Any Task: Instruction-Finetuned Text Embeddings Paper Explanation and Collab Demo
    In this video I explain about INSTRUCTOR, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor achieves sota on 70 diverse embedding tasks! I also show a google collab demo of instructor https://youtu.be/vg38cq3KJ6M submitted by /u/Sea-Photo5230 [link] [comments]  ( 42 min )
    [R] Easiest way to train RNN's in MATLAB or Julia?
    I work as as a researcher and am kind of new to neural networks. I have an RNN (1e4 x 1e4 network) that I would like to train in either MATLAB or Julia. One option I considered is writing my own code for Hessian-free optimization, but the implementational details are really, really hard to figure out. I am aware there is a Theano or TF implementation of HFO but I I am primarily interested in having the code in MATLAB/Julia. Also, are there better/alternative techniques than Hessian-free optimization for training RNN's ? submitted by /u/NadaBrothers [link] [comments]  ( 44 min )
    Can an AI model licensed under the BigScience RAIL License v1.0 such as BLOOM be used in a program that is useful for any domain? [D]
    Example: the AI model BLOOM) is licensed under the BigScience RAIL License v1.0. The BigScience RAIL License v1.0 forbids that some types of usages: You agree not to use the Model or Derivatives of the Model: [...] To provide medical advice and medical results interpretation; To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use). Am I allowed to use BLOOM in a program that is useful for any domain (e.g., a program to summarize or paraphrase some text, or perform question-answer on a text, or generate questions and their answers based on the text)? Since people could use the program for any domain, they could technically, for example, use the program to summarize a medical report or generate questions and their answers based on some asylum process to distribute to potential applicants. submitted by /u/Franck_Dernoncourt [link] [comments]  ( 44 min )
  • Open

    DM Control Suite vs. Original Environments
    I’m testing out DM control suite as I’d ideally like to do some stuff with the MuJoCo environments, e.g. Hopper. However, it seems as though they’ve changed Hopper from the OpenAI version? For instance, the action space is now 4-dimensional, and the bigger concern for me is that the reward seems to be specified differently. Per the gym documentation, the reward was healthy_reward + forward_reward - ctrl_cost, but when I’ve just started using the control suite version all rewards seem to be 0. The documentation for the control suite is quite poor, it says that for the hop task it is rewarded for torso heigh and forward velocity. It also doesn’t explain what the action dimensions correspond to (including the new dimension), so I can’t even manually test it! submitted by /u/DefinitelyNot4Burner [link] [comments]  ( 41 min )
    Does action masking reduce the ability of the agent to learn game rules?
    I recently experimented with training an sb3 PPO agent on a pretty complicated board game environment (just for fun). At first, I did regular PPO with an invalid action penalty, but it was making a lot of invalid moves and thus getting penalized and terminated early. It very slowly picked up on the signal and started to learn, but much too slowly to get any good results. After days of training, it could usually only play a handful of opening moves. On the other hand, I trained a Masked PPO in the same environment and it rapidly became quite good and was able to play relatively competitively after a few days of training. However, when I examined the outputs in an unmasked setting, it had little-to-no understanding of the game rules. It could still play OK but did not rank valid moves as the highest. This is a problem because I wanted to use it in a non-simulator setting without having to explicitly manually mask the moves by hand (or else convert a game state to a mask, both of which are tedious in my situation). Is this behavior expected? I have read some analyses that suggest that 1) MaskedPPO is much more sample efficient and should converge to a stronger agent MUCH faster, which makes sense, but also that 2) Even despite the invalid action masking, the agent should still learn game mechanics by proxy. If it's only being rewarded for making valid moves, it should learn to not make invalid moves implicitly since it never gets a reward signal for them (rather than being explicitly penalized). Thoughts? I only have a weak background in RL so apologies if this is naive. TLDR: Does action masking make the policy (or reward) network lazy? submitted by /u/TobusFire [link] [comments]  ( 44 min )
    Any List of Videogames With Reinforced Learning Agents Developed?
    Is there any list of videogames for which agents using reinforcement learning have been developed? Enquiring minds wanna know. submitted by /u/sanman [link] [comments]  ( 42 min )
    Learning to Exploit Elastic Actuators for Quadruped Locomotion
    submitted by /u/araffin2 [link] [comments]  ( 40 min )
    What is the limit on parallel environments?
    Is there some sort of hard or practical limit on the number of parallel environments that can be used? In Rllib when I try to use more 7 or 8 I get a scheduling error but yet I see people talking about 32 or 512 environments. What’s the limit? Is there some way for me to increase the amount I can train on? For example, my GPU seems under utilised but my CPU is very stressed, can I incrrrase GPU usage in Rllib? I have already set the number of GPUs to one. submitted by /u/centripetalstranger [link] [comments]  ( 42 min )
    Weird convergence of PPO reward when reducing number of envs
    Hi all, I am using Isaac Gym which enables the usage of multi environments. However, the reward value from the best environment has a huge difference, when training the agent with 512 environment (green) and 32 environment (orange), see below. I understand that the training should be slower when using less environments at the same time, but this difference tells me that I am missing something here... Does anyone have some hints? https://preview.redd.it/3rf0ax8653ea1.png?width=1589&format=png&auto=webp&s=b12defff668f186381b32c9f0385499b4413a3cd Below you can see the configs that I used for the PPO algorithm: config: name: ${resolve_default:CustomTask,${....experiment}} full_experiment_name: ${.name} env_name: rlgpu ppo: True mixed_precision: False normalize_input: True normalize_value: True value_bootstrap: True num_actors: ${....task.env.numEnvs} reward_shaper: scale_value: 1.0 normalize_advantage: True gamma: 0.99 tau: 0.95 learning_rate: 5e-4 lr_schedule: adaptive kl_threshold: 0.008 score_to_win: 10000000 max_epochs: ${resolve_default:5000,${....max_iterations}} save_best_after: 200 save_frequency: 100 print_stats: False use_action_masks: False grad_norm: 1.0 entropy_coef: 0.0001 truncate_grads: True e_clip: 0.2 horizon_length: 32 # num_envs * horizon length % minibatch_size minibatch_size: 1024 mini_epochs: 8 critic_coef: 4 clip_value: True seq_len: 4 bounds_loss_coef: 0.0001 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 41 min )
  • Open

    Ten Productivity Hacks using ChaptGPT Generative AI Prompts
    Generative AI is suddenly everywhere. Because of this, the future of AI looks very bright indeed. There are many opportunities for generative AI to impact life and business in both positive and negative ways in the near future. Because the consequence of negative human impacts can easily far outweigh the benefits of positive human impacts, the… Read More »Ten Productivity Hacks using ChaptGPT Generative AI Prompts The post Ten Productivity Hacks using ChaptGPT Generative AI Prompts appeared first on Data Science Central.  ( 21 min )
    Innovation at the Convergence of Emerging Technologies: Business at the Edge
    In the context of digital transformation and innovation, there is no lack of “hot topics” to discuss. Emerging technologies are truly emerging everywhere. What is most exciting – and what demonstrates their greatest promise – is that these new technologies are converging to produce innovative new businesses, products, and services. Over the past decade, we… Read More »Innovation at the Convergence of Emerging Technologies: Business at the Edge The post Innovation at the Convergence of Emerging Technologies: Business at the Edge appeared first on Data Science Central.  ( 22 min )
    Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems)
    In a recent article on Autonomous Intelligent Systems (AIS) [1], Ajit Joakar described various features and characteristics of such systems, including associated technologies and research areas, building blocks and core elements, critical factors for success, and cross-cutting enablers. He introduces AIS as an “emerging interdisciplinary field that deals with situations where humans interact with AI systems… Read More »Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems) The post Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems) appeared first on Data Science Central.  ( 23 min )
  • Open

    Learning with Queried Hints
    Posted by Sreenivas Gollapudi, Senior Staff Research Scientist, and Kostas Kollias, Staff Research Scientist, Google Research, Algorithms & Optimization Team In many computing applications the system needs to make decisions to serve requests that arrive in an online fashion. Consider, for instance, the example of a navigation app that responds to driver requests. In such settings there is inherent uncertainty about important aspects of the problem. For example, the preferences of the driver with respect to features of the route are often unknown and the delays of road segments can be uncertain. The field of online machine learning studies such settings and provides various techniques for decision-making problems under uncertainty. A navigation engine has to decide how to route thi…  ( 92 min )
  • Open

    Newbie in need – very bad NLP text generator needed
    Hell y'all, after spending an hour typing various combinations of "AI", "TEXT GENERATOR", "DATA FEEDING" and such I come to you with a humble request; can someone recommend me an AI text generator that needs to be fed actual, existing text(s) instead of giving it a prompt? I need something that will create a text based on an essay I will upload, and I really don't need the result to be super great. I will accept and in fact warmly welcome any nonsensical output, as I need AI to spew out trash, actually. I just need that trash to resemble things that actually exist. Grammatical errors are great, idiotic sentences like "actually, when the orange fades into original edition of Alexander, then how do we expect the succession to leave Canada?" is what I need, the less the final product sounds like what a human could write, the better. I don't have the proper lingo to google what I need to find, so I am deeply grateful for any suggestions. submitted by /u/lindybopperette [link] [comments]  ( 42 min )
    Computer Vision Development
    Hey!! I'm new to this side of things. I studied psych research, always had an interest in data visualization and neuroscience but didn't realize I should be piecing the two interests together and I have been too intimidated to take on the task of learning computer science. But I can't help myself any longer! I'm so fascinated and think reddit could be a great place to learn and chat about concepts. Any who... YA! I've started watching https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=1 and already can't believe we haven't solved the process of vision. Have we? Can we? The meta is getting to me. submitted by /u/angelacarolei [link] [comments]  ( 41 min )
  • Open

    Research Focus: Week of January 23, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Revolutionizing Document AI with multimodal document foundation models   Organizations must digitize various documents, many with charts and images, to manage and streamline essential functions. Yet manually […] The post Research Focus: Week of January 23, 2023 appeared first on Microsoft Research.  ( 8 min )
    Biomedical Research Platform Terra Now Available on Microsoft Azure
    We stand at the threshold of a new era of precision medicine, where health and life sciences data hold the potential to dramatically propel and expand our understanding and treatment of human disease. One of the tools that we believe will help to enable precision medicine is Terra, the secure biomedical research platform co-developed by […] The post Biomedical Research Platform Terra Now Available on Microsoft Azure appeared first on Microsoft Research.  ( 9 min )
  • Open

    Build a loyalty points anomaly detector using Amazon Lookout for Metrics
    Today, gaining customer loyalty cannot be a one-off thing. A brand needs a focused and integrated plan to retain its best customers—put simply, it needs a customer loyalty program. Earn and burn programs are one of the main paradigms. A typical earn and burn program rewards customers after a certain number of visits or spend. […]  ( 7 min )
    Explain text classification model predictions using Amazon SageMaker Clarify
    Model explainability refers to the process of relating the prediction of a machine learning (ML) model to the input feature values of an instance in humanly understandable terms. This field is often referred to as explainable artificial intelligence (XAI). Amazon SageMaker Clarify is a feature of Amazon SageMaker that enables data scientists and ML engineers […]  ( 10 min )
    Upscale images with Stable Diffusion in Amazon SageMaker JumpStart
    In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models in Amazon SageMaker JumpStart. Today, we announce a new feature that lets you upscale images (resize images without losing quality) with Stable Diffusion models in JumpStart. An image that is low resolution, blurry, and pixelated can be converted […]  ( 10 min )
    Cohere brings language AI to Amazon SageMaker
    It’s an exciting day for the development community. Cohere’s state-of-the-art language AI is now available through Amazon SageMaker. This makes it easier for developers to deploy Cohere’s pre-trained generation language model to Amazon SageMaker, an end-to-end machine learning (ML) service. Developers, data scientists, and business analysts use Amazon SageMaker to build, train, and deploy ML models quickly and easily using its fully managed infrastructure, tools, and workflows.  ( 6 min )
  • Open

    Braced From Space: Startup Keeps Watchful Eye on Gas Pipeline Leaks Across the Globe
    As its name suggests, Orbital Sidekick is creating technology that acts as a buddy in outer space, keeping an eye on the globe using satellites to help keep it safe and sustainable. The San Francisco-based startup, a member of the NVIDIA Inception program, enables commercial and government users to optimize sustainable operations and security with Read article >  ( 6 min )
    NVIDIA CEO Ignites AI Conversation in Stockholm
    Jensen Huang headlines Stockholm AI confab, Berzelius supercomputer upgraded to 94 NVIDIA DGX A100 systems.  ( 6 min )
  • Open

    Number of bits in a particular integer
    When I think of bit twiddling, I think of C. So I was surprised to read Paul Khuong saying he thinks of Common Lisp (“CL”). As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration. I would not have thought […] Number of bits in a particular integer first appeared on John D. Cook.  ( 5 min )
  • Open

    Deep Learning-Based Assessment of Cerebral Microbleeds in COVID-19. (arXiv:2301.09322v1 [eess.IV])
    Cerebral Microbleeds (CMBs), typically captured as hypointensities from susceptibility-weighted imaging (SWI), are particularly important for the study of dementia, cerebrovascular disease, and normal aging. Recent studies on COVID-19 have shown an increase in CMBs of coronavirus cases. Automatic detection of CMBs is challenging due to the small size and amount of CMBs making the classes highly imbalanced, lack of publicly available annotated data, and similarity with CMB mimics such as calcifications, irons, and veins. Hence, the existing deep learning methods are mostly trained on very limited research data and fail to generalize to unseen data with high variability and cannot be used in clinical setups. To this end, we propose an efficient 3D deep learning framework that is actively trained on multi-domain data. Two public datasets assigned for normal aging, stroke, and Alzheimer's disease analysis as well as an in-house dataset for COVID-19 assessment are used to train and evaluate the models. The obtained results show that the proposed method is robust to low-resolution images and achieves 78% recall and 80% precision on the entire test set with an average false positive of 1.6 per scan.  ( 2 min )
    Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning. (arXiv:2301.09350v1 [cs.CL])
    Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors, representing topics of interest for the biomedical community. Several related but distinct biomedical concepts are often grouped together in a single coarse-grained descriptor and are treated as a single topic for semantic indexing. This study proposes a new method for the automated refinement of subject annotations at the level of concepts, investigating deep learning approaches. Lacking labelled data for this task, our method relies on weak supervision based on concept occurrence in the abstract of an article. The proposed approach is evaluated on an extended large-scale retrospective scenario, taking advantage of concepts that eventually become MeSH descriptors, for which annotations become available in MEDLINE/PubMed. The results suggest that concept occurrence is a strong heuristic for automated subject annotation refinement and can be further enhanced when combined with dictionary-based heuristics. In addition, such heuristics can be useful as weak supervision for developing deep learning models that can achieve further improvement in some cases.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Parallel Approaches to Accelerate Bayesian Decision Trees. (arXiv:2301.09090v1 [stat.CO])
    Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Existing work on Bayesian decision trees uses MCMC. Unfortunately, this can be slow, especially when considering large volumes of data. It is hard to parallelise the accept-reject component of the MCMC. None-the-less, we propose two methods for exploiting parallelism in the MCMC: in the first, we replace the MCMC with another numerical Bayesian approach, the Sequential Monte Carlo (SMC) sampler, which has the appealing property that it is an inherently parallel algorithm; in the second, we consider data partitioning. Both methods use multi-core processing with a HighPerformance Computing (HPC) resource. We test the two methods in various study settings to determine which method is the most beneficial for each test case. Experiments show that data partitioning has limited utility in the settings we consider and that the use of the SMC sampler can improve run-time (compared to the sequential implementation) by up to a factor of 343.  ( 2 min )
    Learning Reservoir Dynamics with Temporal Self-Modulation. (arXiv:2301.09235v1 [cs.LG])
    Reservoir computing (RC) can efficiently process time-series data by transferring the input signal to randomly connected recurrent neural networks (RNNs), which are referred to as a reservoir. The high-dimensional representation of time-series data in the reservoir significantly simplifies subsequent learning tasks. Although this simple architecture allows fast learning and facile physical implementation, the learning performance is inferior to that of other state-of-the-art RNN models. In this paper, to improve the learning ability of RC, we propose self-modulated RC (SM-RC), which extends RC by adding a self-modulation mechanism. The self-modulation mechanism is realized with two gating variables: an input gate and a reservoir gate. The input gate modulates the input signal, and the reservoir gate modulates the dynamical properties of the reservoir. We demonstrated that SM-RC can perform attention tasks where input information is retained or discarded depending on the input signal. We also found that a chaotic state emerged as a result of learning in SM-RC. This indicates that self-modulation mechanisms provide RC with qualitatively different information-processing capabilities. Furthermore, SM-RC outperformed RC in NARMA and Lorentz model tasks. In particular, SM-RC achieved a higher prediction accuracy than RC with a reservoir 10 times larger in the Lorentz model tasks. Because the SM-RC architecture only requires two additional gates, it is physically implementable as RC, providing a new direction for realizing edge AI.  ( 2 min )
    Dataset Distillation: A Comprehensive Review. (arXiv:2301.07014v2 [cs.LG] UPDATED)
    Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks.Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training \emph{per se} yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation~(DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive experiments and envision possible directions for future works.  ( 2 min )
    Talking About Large Language Models. (arXiv:2212.03551v3 [cs.CL] UPDATED)
    Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.  ( 2 min )
    Generative Adversarial Networks to infer velocity components in rotating turbulent flows. (arXiv:2301.07541v1 [physics.flu-dyn] CROSS LISTED)
    Inference problems for two-dimensional snapshots of rotating turbulent flows are studied. We perform a systematic quantitative benchmark of point-wise and statistical reconstruction capabilities of the linear Extended Proper Orthogonal Decomposition (EPOD) method, a non-linear Convolutional Neural Network (CNN) and a Generative Adversarial Network (GAN). We attack the important task of inferring one velocity component out of the measurement of a second one, and two cases are studied: (I) both components lay in the plane orthogonal to the rotation axis and (II) one of the two is parallel to the rotation axis. We show that EPOD method works well only for the former case where both components are strongly correlated, while CNN and GAN always outperform EPOD both concerning point-wise and statistical reconstructions. For case (II), when the input and output data are weakly correlated, all methods fail to reconstruct faithfully the point-wise information. In this case, only GAN is able to reconstruct the field in a statistical sense. The analysis is performed using both standard validation tools based on L2 spatial distance between the prediction and the ground truth and more sophisticated multi-scale analysis using wavelet decomposition. Statistical validation is based on standard Jensen-Shannon divergence between the probability density functions, spectral properties and multi-scale flatness.  ( 2 min )
    Reconstructing Rayleigh-Benard flows out of temperature-only measurements using Physics-Informed Neural Networks. (arXiv:2301.07769v1 [physics.flu-dyn] CROSS LISTED)
    We investigate the capabilities of Physics-Informed Neural Networks (PINNs) to reconstruct turbulent Rayleigh-Benard flows using only temperature information. We perform a quantitative analysis of the quality of the reconstructions at various amounts of low-passed-filtered information and turbulent intensities. We compare our results with those obtained via nudging, a classical equation-informed data assimilation technique. At low Rayleigh numbers, PINNs are able to reconstruct with high precision, comparable to the one achieved with nudging. At high Rayleigh numbers, PINNs outperform nudging and are able to achieve satisfactory reconstruction of the velocity fields only when data for temperature is provided with high spatial and temporal density. When data becomes sparse, the PINNs performance worsens, not only in a point-to-point error sense but also, and contrary to nudging, in a statistical sense, as can be seen in the probability density functions and energy spectra.  ( 2 min )
    Clustering Categorical Data: Soft Rounding k-modes. (arXiv:2210.09640v2 [cs.LG] UPDATED)
    Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.
    Learning-Based Data Storage [Vision] (Technical Report). (arXiv:2206.05778v3 [cs.DB] UPDATED)
    Deep neural network (DNN) and its variants have been extensively used for a wide spectrum of real applications such as image classification, face/speech recognition, fraud detection, and so on. In addition to many important machine learning tasks, as artificial networks emulating the way brain cells function, DNNs also show the capability of storing non-linear relationships between input and output data, which exhibits the potential of storing data via DNNs. We envision a new paradigm of data storage, "DNN-as-a-Database", where data are encoded in well-trained machine learning models. Compared with conventional data storage that directly records data in raw formats, learning-based structures (e.g., DNN) can implicitly encode data pairs of inputs and outputs and compute/materialize actual output data of different resolutions only if input data are provided. This new paradigm can greatly enhance the data security by allowing flexible data privacy settings on different levels, achieve low space consumption and fast computation with the acceleration of new hardware (e.g., Diffractive Neural Network and AI chips), and can be generalized to distributed DNN-based storage/computing. In this paper, we propose this novel concept of learning-based data storage, which utilizes a learning structure called learning-based memory unit (LMU), to store, organize, and retrieve data. As a case study, we use DNNs as the engine in the LMU, and study the data capacity and accuracy of the DNN-based data storage. Our preliminary experimental results show the feasibility of the learning-based data storage by achieving high (100%) accuracy of the DNN storage. We explore and design effective solutions to utilize the DNN-based data storage to manage and query relational tables. We discuss how to generalize our solutions to other data types (e.g., graphs) and environments such as distributed DNN storage/computing.
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.
    Computationally-efficient initialisation of GPs: The generalised variogram method. (arXiv:2210.05394v2 [cs.LG] UPDATED)
    We present a computationally-efficient strategy to find the hyperparameters of a Gaussian process (GP) avoiding the computation of the likelihood function. The found hyperparameters can then be used directly for regression or passed as initial conditions to maximum-likelihood (ML) training. Motivated by the fact that training a GP via ML is equivalent (on average) to minimising the KL-divergence between the true and learnt model, we set to explore different metrics/divergences among GPs that are computationally inexpensive and provide estimates close to those of ML. In particular, we identify the GP hyperparameters by projecting the empirical covariance or (Fourier) power spectrum onto a parametric family, thus proposing and studying various measures of discrepancy operating on the temporal or frequency domains. Our contribution extends the Variogram method developed by the geostatistics literature and, accordingly, it is referred to as the Generalised Variogram method (GVM). In addition to the theoretical presentation of GVM, we provide experimental validation in terms of accuracy, consistency with ML and computational complexity for different kernels using synthetic and real-world data.
    FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction. (arXiv:2212.01548v2 [cs.LG] UPDATED)
    Most cross-device federated learning (FL) studies focus on the model-homogeneous setting where the global server model and local client models are identical. However, such constraint not only excludes low-end clients who would otherwise make unique contributions to model training but also restrains clients from training large models due to on-device resource bottlenecks. In this work, we propose FedRolex, a partial training (PT)-based approach that enables model-heterogeneous FL and can train a global server model larger than the largest client model. At its core, FedRolex employs a rolling sub-model extraction scheme that allows different parts of the global server model to be evenly trained, which mitigates the client drift induced by the inconsistency between individual client models and server model architectures. We show that FedRolex outperforms state-of-the-art PT-based model-heterogeneous FL methods (e.g. Federated Dropout) and reduces the gap between model-heterogeneous and model-homogeneous FL, especially under the large-model large-dataset regime. In addition, we provide theoretical statistical analysis on its advantage over Federated Dropout and evaluate FedRolex on an emulated real-world device distribution to show that FedRolex can enhance the inclusiveness of FL and boost the performance of low-end devices that would otherwise not benefit from FL. Our code is available at: https://github.com/AIoT-MLSys-Lab/FedRolex
    Incorporating Task-specific Concept Knowledge into Script Learning. (arXiv:2209.00068v2 [cs.CL] UPDATED)
    In this paper, we present Tetris, a new task of Goal-Oriented Script Completion. Unlike previous work, it considers a more realistic and general setting, where the input includes not only the goal but also additional user context, including preferences and history. To address this problem, we propose a novel approach, which uses two techniques to improve performance: (1) concept prompting, and (2) script-oriented contrastive learning that addresses step repetition and hallucination problems. On our WikiHow-based dataset, we find that both methods improve performance. The dataset, repository, and models will be publicly available to facilitate further research on this new task.
    Concept-level Debugging of Part-Prototype Networks. (arXiv:2205.15769v2 [cs.LG] UPDATED)
    Part-prototype Networks (ProtoPNets) are concept-based classifiers designed to achieve the same performance as black-box models without compromising transparency. ProtoPNets compute predictions based on similarity to class-specific part-prototypes learned to recognize parts of training examples, making it easy to faithfully determine what examples are responsible for any target prediction and why. However, like other models, they are prone to picking up confounders and shortcuts from the data, thus suffering from compromised prediction accuracy and limited generalization. We propose ProtoPDebug, an effective concept-level debugger for ProtoPNets in which a human supervisor, guided by the model's explanations, supplies feedback in the form of what part-prototypes must be forgotten or kept, and the model is fine-tuned to align with this supervision. Our experimental evaluation shows that ProtoPDebug outperforms state-of-the-art debuggers for a fraction of the annotation cost. An online experiment with laypeople confirms the simplicity of the feedback requested to the users and the effectiveness of the collected feedback for learning confounder-free part-prototypes. ProtoPDebug is a promising tool for trustworthy interactive learning in critical applications, as suggested by a preliminary evaluation on a medical decision making task.
    A Comprehensive Survey on Enterprise Financial Risk Analysis: Problems, Methods, Spotlights and Applications. (arXiv:2211.14997v2 [q-fin.RM] UPDATED)
    Enterprise financial risk analysis aims at predicting the enterprises' future financial risk.Due to the wide application, enterprise financial risk analysis has always been a core research issue in finance. Although there are already some valuable and impressive surveys on risk management, these surveys introduce approaches in a relatively isolated way and lack the recent advances in enterprise financial risk analysis. Due to the rapid expansion of the enterprise financial risk analysis, especially from the computer science and big data perspective, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing enterprise financial risk researches, as well as to summarize and interpret the mechanisms and the strategies of enterprise financial risk analysis in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. This paper provides a systematic literature review of over 300 articles published on enterprise risk analysis modelling over a 50-year period, 1968 to 2022. We first introduce the formal definition of enterprise risk as well as the related concepts. Then, we categorized the representative works in terms of risk type and summarized the three aspects of risk analysis. Finally, we compared the analysis methods used to model the enterprise financial risk. Our goal is to clarify current cutting-edge research and its possible future directions to model enterprise risk, aiming to fully understand the mechanisms of enterprise risk communication and influence and its application on corporate governance, financial institution and government regulation.
    On Investigating the Conservative Property of Score-Based Generative Models. (arXiv:2209.12753v2 [cs.LG] UPDATED)
    Existing Score-based Generative Models (SGMs) can be categorized into constrained SGMs (CSGMs) or unconstrained SGMs (USGMs) according to their parameterization approaches. CSGMs model probability density functions as Boltzmann distributions, and assign their predictions as the negative gradients of some scalar-valued energy functions. On the other hand, USGMs employ flexible architectures capable of directly estimating scores without the need to explicitly model energy functions. In this paper, we demonstrate that the architectural constraints of CSGMs may limit their modeling ability. In addition, we show that USGMs' inability to preserve the property of conservativeness may lead to degraded sampling performance in practice. To address the above issues, we propose Quasi-Conservative Score-based Generative Models (QCSGMs) for keeping the advantages of both CSGMs and USGMs. Our theoretical derivations demonstrate that the training objective of QCSGMs can be efficiently integrated into the training processes by leveraging the Hutchinson trace estimator. In addition, our experimental results on the CIFAR-10, CIFAR-100, ImageNet, and SVHN datasets validate the effectiveness of QCSGMs. Finally, we justify the advantage of QCSGMs using an example of a one-layered autoencoder.
    Learning from Long-Tailed Noisy Data with Sample Selection and Balanced Loss. (arXiv:2211.10906v2 [cs.LG] UPDATED)
    The success of deep learning depends on large-scale and well-curated training data, while data in real-world applications are commonly long-tailed and noisy. Many methods have been proposed to deal with long-tailed data or noisy data, while a few methods are developed to tackle long-tailed noisy data. To solve this, we propose a robust method for learning from long-tailed noisy data with sample selection and balanced loss. Specifically, we separate the noisy training data into clean labeled set and unlabeled set with sample selection, and train the deep neural network in a semi-supervised manner with a balanced loss based on model bias. Extensive experiments on benchmarks demonstrate that our method outperforms existing state-of-the-art methods.
    FairGBM: Gradient Boosting with Fairness Constraints. (arXiv:2209.07850v3 [cs.LG] UPDATED)
    Tabular data is prevalent in many high stakes domains, such as financial services or public policy. Gradient boosted decision trees (GBDT) are popular in these settings due to performance guarantees and low cost. However, in consequential decision-making fairness is a foremost concern. Despite GBDT's popularity, existing in-processing Fair ML methods are either inapplicable to GBDT, or incur in significant train time overhead, or are inadequate for problems with high class imbalance -- a typical issue in these domains. We present FairGBM, a dual ascent learning framework for training GBDT under fairness constraints, with little to no impact on predictive performance when compared to unconstrained GBDT. Since observational fairness metrics are non-differentiable, we have to employ a "proxy-Lagrangian" formulation using smooth convex error rate proxies to enable gradient-based optimization. Our implementation shows an order of magnitude speedup in training time when compared with related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.
    STaSy: Score-based Tabular data Synthesis. (arXiv:2210.04018v2 [cs.LG] UPDATED)
    Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.
    On the power of foundation models. (arXiv:2211.16327v2 [cs.AI] UPDATED)
    With infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? This question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. In this paper, we show that category theory provides powerful machinery to answer this question. We have proved three results. The first one limits the power of prompt-based learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. The second one says fine tuning does not have this limit, as a foundation model with the minimum required power (up to symmetry) can theoretically solve downstream tasks with fine tuning and enough resources. Our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). Along the way, we provide a categorical framework for supervised and self-supervised learning, which might be of independent interest.
    SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. (arXiv:2210.05861v2 [cs.CV] UPDATED)
    Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
    Shortest Path Networks for Graph Property Prediction. (arXiv:2206.01003v4 [cs.LG] UPDATED)
    Most graph neural network models rely on a particular message passing paradigm, where the idea is to iteratively propagate node representations of a graph to each node in the direct neighborhood. While very prominent, this paradigm leads to information propagation bottlenecks, as information is repeatedly compressed at intermediary node representations, which causes loss of information, making it practically impossible to gather meaningful signals from distant nodes. To address this, we propose shortest path message passing neural networks, where the node representations of a graph are propagated to each node in the shortest path neighborhoods. In this setting, nodes can directly communicate between each other even if they are not neighbors, breaking the information bottleneck and hence leading to more adequately learned representations. Our framework generalizes message passing neural networks, resulting in a class of more expressive models, including some recent state-of-the-art models. We verify the capacity of a basic model of this framework on dedicated synthetic experiments, and on real-world graph classification and regression benchmarks, and obtain state-of-the art results.
    Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP. (arXiv:2207.11126v2 [cs.LG] UPDATED)
    We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of $\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/\delta)})$ with probability $1-\delta$, where $\mathcal{P}$ and $\mathcal{G}$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{G}|)/\ln(|A|)})$, on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains $\widetilde{O}(T^{3/4})$ regret.
    DIVISION: Memory Efficient Training via Dual Activation Precision. (arXiv:2208.04187v3 [cs.LG] UPDATED)
    Existing work of activation compressed training relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual Activation Precision (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10x compression of activation maps, and significantly higher training throughput than state-of-the-art ACT methods, without loss of model accuracy.
    Explainable Image Quality Assessments in Teledermatological Photography. (arXiv:2209.04699v2 [cs.CV] UPDATED)
    Image quality is a crucial factor in the effectiveness and efficiency of teledermatological consultations. However, up to 50% of images sent by patients have quality issues, thus increasing the time to diagnosis and treatment. An automated, easily deployable, explainable method for assessing image quality is necessary to improve the current teledermatological consultation flow. We introduce ImageQX, a convolutional neural network for image quality assessment with a learning mechanism for identifying the most common poor image quality explanations: bad framing, bad lighting, blur, low resolution, and distance issues. ImageQX was trained on 26,635 photographs and validated on 9,874 photographs, each annotated with image quality labels and poor image quality explanations by up to 12 board-certified dermatologists. The photographic images were taken between 2017 and 2019 using a mobile skin disease tracking application accessible worldwide. Our method achieves expert-level performance for both image quality assessment and poor image quality explanation. For image quality assessment, ImageQX obtains a macro F1-score of 0.73 +- 0.01, which places it within standard deviation of the pairwise inter-rater F1-score of 0.77 +- 0.07. For poor image quality explanations, our method obtains F1-scores of between 0.37 +- 0.01 and 0.70 +- 0.01, similar to the inter-rater pairwise F1-score of between 0.24 +- 0.15 and 0.83 +- 0.06. Moreover, with a size of only 15 MB, ImageQX is easily deployable on mobile devices. With an image quality detection performance similar to that of dermatologists, incorporating ImageQX into the teledermatology flow can enable a better, faster flow for remote consultations.
    Pseudo-Hamiltonian Neural Networks with State-Dependent External Forces. (arXiv:2206.02660v4 [cs.LG] UPDATED)
    Hybrid machine learning based on Hamiltonian formulations has recently been successfully demonstrated for simple mechanical systems, both energy conserving and not energy conserving. We introduce a pseudo-Hamiltonian formulation that is a generalization of the Hamiltonian formulation via the port-Hamiltonian formulation, and show that pseudo-Hamiltonian neural network models can be used to learn external forces acting on a system. We argue that this property is particularly useful when the external forces are state dependent, in which case it is the pseudo-Hamiltonian structure that facilitates the separation of internal and external forces. Numerical results are provided for a forced and damped mass-spring system and a tank system of higher complexity, and a symmetric fourth-order integration scheme is introduced for improved training on sparse and noisy data.
    Overfitting in quantum machine learning and entangling dropout. (arXiv:2205.11446v2 [quant-ph] UPDATED)
    The ultimate goal in machine learning is to construct a model function that has a generalization capability for unseen dataset, based on given training dataset. If the model function has too much expressibility power, then it may overfit to the training data and as a result lose the generalization capability. To avoid such overfitting issue, several techniques have been developed in the classical machine learning regime, and the dropout is one such effective method. This paper proposes a straightforward analogue of this technique in the quantum machine learning regime, the entangling dropout, meaning that some entangling gates in a given parametrized quantum circuit are randomly removed during the training process to reduce the expressibility of the circuit. Some simple case studies are given to show that this technique actually suppresses the overfitting.
    On A Mallows-type Model For (Ranked) Choices. (arXiv:2207.01783v2 [cs.LG] UPDATED)
    We consider a preference learning setting where every participant chooses an ordered list of $k$ most preferred items among a displayed set of candidates. (The set can be different for every participant.) We identify a distance-based ranking model for the population's preferences and their (ranked) choice behavior. The ranking model resembles the Mallows model but uses a new distance function called Reverse Major Index (RMJ). We find that despite the need to sum over all permutations, the RMJ-based ranking distribution aggregates into (ranked) choice probabilities with simple closed-form expression. We develop effective methods to estimate the model parameters and showcase their generalization power using real data, especially when there is a limited variety of display sets.
    Predicting highway lane-changing maneuvers: A benchmark analysis of machine and ensemble learning algorithms. (arXiv:2204.10807v3 [cs.LG] UPDATED)
    Understanding and predicting highway lane-change maneuvers is essential for driving modeling and its automation. The development of data-based lane-changing decision-making algorithms is nowadays in full expansion. We compare empirically in this article different machine and ensemble learning classification techniques to the MOBIL rule-based model using trajectory data of European two-lane highways. The analysis relies on instantaneous measurements of up to twenty-four spatial-temporal variables with the four neighboring vehicles on current and adjacent lanes. Preliminary descriptive investigations by principal component and logistic analyses allow identifying main variables intending a driver to change lanes. We predict two types of discretionary lane-change maneuvers: overtaking (from the slow to the fast lane) and fold-down (from the fast to the slow lane). The prediction accuracy is quantified using total, lane-changing and lane-keeping errors and associated receiver operating characteristic curves. The benchmark analysis includes logistic model, linear discriminant, decision tree, na\"ive Bayes classifier, support vector machine, neural network machine learning algorithms, and up to ten bagging and stacking ensemble learning meta-heuristics. If the rule-based model provides limited predicting accuracy, the data-based algorithms, devoid of modeling bias, allow significant prediction improvements. Cross validations show that selected neural networks and stacking algorithms allow predicting from a single observation both fold-down and overtaking maneuvers up to four seconds in advance with high accuracy.
    EvenNet: Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks. (arXiv:2205.13892v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have received extensive research attention for their promising performance in graph machine learning. Despite their extraordinary predictive accuracy, existing approaches, such as GCN and GPRGNN, are not robust in the face of homophily changes on test graphs, rendering these models vulnerable to graph structural attacks and with limited capacity in generalizing to graphs of varied homophily levels. Although many methods have been proposed to improve the robustness of GNN models, most of these techniques are restricted to the spatial domain and employ complicated defense mechanisms, such as learning new graph structures or calculating edge attentions. In this paper, we study the problem of designing simple and robust GNN models in the spectral domain. We propose EvenNet, a spectral GNN corresponding to an even-polynomial graph filter. Based on our theoretical analysis in both spatial and spectral domains, we demonstrate that EvenNet outperforms full-order models in generalizing across homophilic and heterophilic graphs, implying that ignoring odd-hop neighbors improves the robustness of GNNs. We conduct experiments on both synthetic and real-world datasets to demonstrate the effectiveness of EvenNet. Notably, EvenNet outperforms existing defense models against structural attacks without introducing additional computational costs and maintains competitiveness in traditional node classification tasks on homophilic and heterophilic graphs.
    Contrastive Learning for Unsupervised Domain Adaptation of Time Series. (arXiv:2206.06243v3 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims at learning a machine learning model using a labeled source domain that performs well on a similar yet different, unlabeled target domain. UDA is important in many applications such as medicine, where it is used to adapt risk scores across different patient cohorts. In this paper, we develop a novel framework for UDA of time series data, called CLUDA. Specifically, we propose a contrastive learning framework to learn contextual representations in multivariate time series, so that these preserve label information for the prediction task. In our framework, we further capture the variation in the contextual representations between source and target domain via a custom nearest-neighbor contrastive learning. To the best of our knowledge, ours is the first framework to learn domain-invariant, contextual representation for UDA of time series data. We evaluate our framework using a wide range of time series datasets to demonstrate its effectiveness and show that it achieves state-of-the-art performance for time series UDA.
    An Explainable-AI approach for Diagnosis of COVID-19 using MALDI-ToF Mass Spectrometry. (arXiv:2109.14099v2 [cs.LG] UPDATED)
    The severe acute respiratory syndrome coronavirus type-2 (SARS-CoV-2) caused a global pandemic and imposed immense effects on the global economy. Accurate, cost-effective, and quick tests have proven substantial in identifying infected people and mitigating the spread. Recently, multiple alternative platforms for testing coronavirus disease 2019 (COVID-19) have been published that show high agreement with current gold standard real-time polymerase chain reaction (RT-PCR) results. These new methods do away with nasopharyngeal (NP) swabs, eliminate the need for complicated reagents, and reduce the burden on RT-PCR test reagent supply. In the present work, we have designed an artificial intelligence-based (AI) testing method to provide confidence in the results. Current AI applications to COVID-19 studies often lack a biological foundation in the decision-making process, and our AI approach is one of the earliest to leverage explainable-AI (X-AI) algorithms for COVID-19 diagnosis using mass spectrometry. Here, we have employed X-AI to explain the decision-making process on a local (per-sample) and global (all samples) basis underscored by biologically relevant features. We evaluated our technique with data extracted from human gargle samples and achieved a testing accuracy of 94.44%. Such techniques would strengthen the relationship between AI and clinical diagnostics by providing biomedical researchers and healthcare workers with trustworthy and, most importantly, explainable test results.
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.
    A Survey on Distributed Online Optimization and Game. (arXiv:2205.00473v2 [cs.LG] UPDATED)
    Distributed online optimization and game have been increasingly researched in the last decade, mostly motivated by its wide applications in sensor networks, robotics (e.g., distributed target tracking and formation control), smart grids, deep learning, and so forth. In these problems, there is a network of agents who may be cooperative (i.e., distributed online optimization) or noncooperative (i.e., online game) through local information exchanges. And the local cost function of each agent is often time-varying in dynamic and even adversarial environments. At each time, a decision must be made by each agent based on historical information at hand without knowing future information on cost functions. For these problems, a comprehensive survey is still lacking. This paper aims to provide a thorough overview of distributed online optimization and game from the perspective of problem settings, communication, computation, algorithms, and performances. In addition, some potential future directions are also discussed.
    Stochastic Second-Order Methods Improve Best-Known Sample Complexity of SGD for Gradient-Dominated Function. (arXiv:2205.12856v2 [cs.LG] UPDATED)
    We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\le\alpha\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $\epsilon$-global optimum is $\mathcal{O}(\epsilon^{-7/(2\alpha)+1})$ for $1\le\alpha< 3/2$ and $\mathcal{\tilde{O}}(\epsilon^{-2/(\alpha)})$ for $3/2\le\alpha\le 2$. SCRN improves the best-known sample complexity of stochastic gradient descent. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the average sample complexity of SCRN can be reduced to ${\mathcal{O}}(\epsilon^{-2})$ for $\alpha=1$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
    Sustaining Fairness via Incremental Learning. (arXiv:2208.12212v2 [cs.LG] UPDATED)
    Machine learning systems are often deployed for making critical decisions like credit lending, hiring, etc. While making decisions, such systems often encode the user's demographic information (like gender, age) in their intermediate representations. This can lead to decisions that are biased towards specific demographics. Prior work has focused on debiasing intermediate representations to ensure fair decisions. However, these approaches fail to remain fair with changes in the task or demographic distribution. To ensure fairness in the wild, it is important for a system to adapt to such changes as it accesses new data in an incremental fashion. In this work, we propose to address this issue by introducing the problem of learning fair representations in an incremental learning setting. To this end, we present Fairness-aware Incremental Representation Learning (FaIRL), a representation learning system that can sustain fairness while incrementally learning new tasks. FaIRL is able to achieve fairness and learn new tasks by controlling the rate-distortion function of the learned representations. Our empirical evaluations show that FaIRL is able to make fair decisions while achieving high performance on the target task, outperforming several baselines.
    Generating Diverse Teammates to Train Robust Agents For Ad Hoc Teamwork. (arXiv:2207.14138v2 [cs.LG] UPDATED)
    Ad hoc teamwork (AHT) is the challenge of designing a learner that effectively collaborates with unknown teammates without prior coordination mechanisms. Early approaches address the AHT challenge by training the learner with a diverse set of handcrafted teammate policies, usually designed based on an expert's domain knowledge about the policies the learner may encounter. However, implementing teammate policies for training based on domain knowledge is not always feasible. In such cases, recent approaches attempted to improve the robustness of the learner by training it with teammate policies generated by optimising information-theoretic diversity metrics. However, optimising information-theoretic diversity metrics may generate teammates with superficially different behaviours, which does not necessarily result in a robust learner that can effectively collaborate with unknown teammates. In this paper, we present an automated teammate policy generation method optimising the Best-Response Diversity (BRDiv) metric, which measures diversity based on the compatibility of teammate policies in terms of returns. We evaluate our approach in environments with multiple valid coordination strategies, comparing against methods optimising information-theoretic diversity metrics and an ablation not optimising any diversity metric. Our experiments indicate that optimising BRDiv yields a diverse set of training teammate policies that improve the learner's performance relative to previous teammate generation approaches when collaborating with near-optimal previously unseen teammate policies.
    Do Gradient Inversion Attacks Make Federated Learning Unsafe?. (arXiv:2202.06924v2 [cs.LG] UPDATED)
    Federated learning (FL) allows the collaborative training of AI models without needing to share raw data. This capability makes it especially interesting for healthcare applications where patient and data privacy is of utmost concern. However, recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data. In this work, we show that these attacks presented in the literature are impractical in FL use-cases where the clients' training involves updating the Batch Normalization (BN) statistics and provide a new baseline attack that works for such scenarios. Furthermore, we present new ways to measure and visualize potential data leakage in FL. Our work is a step towards establishing reproducible methods of measuring data leakage in FL and could help determine the optimal tradeoffs between privacy-preserving techniques, such as differential privacy, and model accuracy based on quantifiable metrics. Code is available at https://nvidia.github.io/NVFlare/research/quantifying-data-leakage.
    GANs and Closures: Micro-Macro Consistency in Multiscale Modeling. (arXiv:2208.10715v2 [cs.LG] UPDATED)
    Sampling the phase space of molecular systems -- and, more generally, of complex systems effectively modeled by stochastic differential equations -- is a crucial modeling step in many fields, from protein folding to materials discovery. These problems are often multiscale in nature: they can be described in terms of low-dimensional effective free energy surfaces parametrized by a small number of "slow" reaction coordinates; the remaining "fast" degrees of freedom populate an equilibrium measure on the reaction coordinate values. Sampling procedures for such problems are used to estimate effective free energy differences as well as ensemble averages with respect to the conditional equilibrium distributions; these latter averages lead to closures for effective reduced dynamic models. Over the years, enhanced sampling techniques coupled with molecular simulation have been developed. An intriguing analogy arises with the field of Machine Learning (ML), where Generative Adversarial Networks can produce high dimensional samples from low dimensional probability distributions. This sample generation returns plausible high dimensional space realizations of a model state, from information about its low-dimensional representation. In this work, we present an approach that couples physics-based simulations and biasing methods for sampling conditional distributions with ML-based conditional generative adversarial networks for the same task. The "coarse descriptors" on which we condition the fine scale realizations can either be known a priori, or learned through nonlinear dimensionality reduction. We suggest that this may bring out the best features of both approaches: we demonstrate that a framework that couples cGANs with physics-based enhanced sampling techniques can improve multiscale SDE dynamical systems sampling, and even shows promise for systems of increasing complexity.
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.
    Analyzing Data-Centric Properties for Graph Contrastive Learning. (arXiv:2208.02810v3 [cs.LG] UPDATED)
    Recent analyses of self-supervised learning (SSL) find the following data-centric properties to be critical for learning good representations: invariance to task-irrelevant semantics, separability of classes in some latent space, and recoverability of labels from augmented samples. However, given their discrete, non-Euclidean nature, graph datasets and graph SSL methods are unlikely to satisfy these properties. This raises the question: how do graph SSL methods, such as contrastive learning (CL), work well? To systematically probe this question, we perform a generalization analysis for CL when using generic graph augmentations (GGAs), with a focus on data-centric properties. Our analysis yields formal insights into the limitations of GGAs and the necessity of task-relevant augmentations. As we empirically show, GGAs do not induce task-relevant invariances on common benchmark datasets, leading to only marginal gains over naive, untrained baselines. Our theory motivates a synthetic data generation process that enables control over task-relevant information and boasts pre-defined optimal augmentations. This flexible benchmark helps us identify yet unrecognized limitations in advanced augmentation techniques (e.g., automated methods). Overall, our work rigorously contextualizes, both empirically and theoretically, the effects of data-centric properties on augmentation strategies and learning paradigms for graph SSL.
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.
    Predictive Model for Gross Community Production Rate of Coral Reefs using Ensemble Learning Methodologies. (arXiv:2111.04003v2 [cs.LG] UPDATED)
    Coral reefs play a vital role in maintaining the ecological balance of the marine ecosystem. Various marine organisms depend on coral reefs for their existence and their natural processes. Coral reefs provide the necessary habitat for reproduction and growth for various exotic species of the marine ecosystem. In this article, we discuss the most important parameters which influence the lifecycle of coral and coral reefs such as ocean acidification, deoxygenation and other physical parameters such as flow rate and surface area. Ocean acidification depends on the amount of dissolved Carbon dioxide (CO2). This is due to the release of H+ ions upon the reaction of the dissolved CO2 gases with the calcium carbonate compounds in the ocean. Deoxygenation is another problem that leads to hypoxia which is characterized by a lesser amount of dissolved oxygen in water than the required amount for the existence of marine organisms. In this article, we highlight the importance of physical parameters such as flow rate which influence gas exchange, heat dissipation, bleaching sensitivity, nutrient supply, feeding, waste and sediment removal, growth and reproduction. In this paper, we also bring out these important parameters and propose an ensemble machine learning-based model for analyzing these parameters and provide better rates that can help us to understand and suitably improve the ocean composition which in turn can eminently improve the sustainability of the marine ecosystem, mainly the coral reefs
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.
    Adversarial Examples in Random Neural Networks with General Activations. (arXiv:2203.17209v2 [cs.LG] UPDATED)
    A substantial body of empirical work documents the lack of robustness in deep learning models to adversarial examples. Recent theoretical work proved that adversarial examples are ubiquitous in two-layers networks with sub-exponential width and ReLU or smooth activations, and multi-layer ReLU networks with sub-exponential width. We present a result of the same type, with no restriction on width and for general locally Lipschitz continuous activations. More precisely, given a neural network $f(\,\cdot\,;{\boldsymbol \theta})$ with random weights ${\boldsymbol \theta}$, and feature vector ${\boldsymbol x}$, we show that an adversarial example ${\boldsymbol x}'$ can be found with high probability along the direction of the gradient $\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$. Our proof is based on a Gaussian conditioning technique. Instead of proving that $f$ is approximately linear in a neighborhood of ${\boldsymbol x}$, we characterize the joint distribution of $f({\boldsymbol x};{\boldsymbol \theta})$ and $f({\boldsymbol x}';{\boldsymbol \theta})$ for ${\boldsymbol x}' = {\boldsymbol x}-s({\boldsymbol x})\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$.
    MetaQA: Combining Expert Agents for Multi-Skill Question Answering. (arXiv:2112.01922v3 [cs.CL] UPDATED)
    The recent explosion of question answering (QA) datasets and models has increased the interest in the generalization of models across multiple domains and formats by either training on multiple datasets or by combining multiple models. Despite the promising results of multi-dataset models, some domains or QA formats may require specific architectures, and thus the adaptability of these models might be limited. In addition, current approaches for combining models disregard cues such as question-answer compatibility. In this work, we propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores to select the best answer among a list of answer candidates. Through quantitative and qualitative experiments we show that our model i) creates a collaboration between agents that outperforms previous multi-agent and multi-dataset approaches in both in-domain and out-of-domain scenarios, ii) is highly data-efficient to train, and iii) can be adapted to any QA format. We release our code and a dataset of answer predictions from expert agents for 16 QA datasets to foster future developments of multi-agent systems on https://github.com/UKPLab/MetaQA.
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.
    Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering. (arXiv:2204.04581v3 [cs.CL] UPDATED)
    Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at the cost of interpretability, controllability, and efficiency. The opposite properties arise in other methods which have instead relied on knowledge base (KB) facts. At the same time, more recent work has demonstrated the effectiveness of storing and retrieving from an index of Q-A pairs derived from text \citep{lewis2021paq}. This approach yields a high coverage knowledge representation that maintains KB-like properties due to its representations being more atomic units of information. In this work we push this line of research further by proposing a question-answer augmented encoder-decoder model and accompanying pretraining strategy. This yields an end-to-end system that not only outperforms prior QA retrieval methods on single-hop QA tasks but also enables compositional reasoning, as demonstrated by strong performance on two multi-hop QA datasets. Together, these methods improve the ability to interpret and control the model while narrowing the performance gap with passage retrieval systems.
    Learning Regionally Decentralized AC Optimal Power Flows with ADMM. (arXiv:2205.03787v3 [eess.SY] UPDATED)
    One potential future for the next generation of smart grids is the use of decentralized optimization algorithms and secured communications for coordinating renewable generation (e.g., wind/solar), dispatchable devices (e.g., coal/gas/nuclear generations), demand response, battery & storage facilities, and topology optimization. The Alternating Direction Method of Multipliers (ADMM) has been widely used in the community to address such decentralized optimization problems and, in particular, the AC Optimal Power Flow (AC-OPF). This paper studies how machine learning may help in speeding up the convergence of ADMM for solving AC-OPF. It proposes a novel decentralized machine-learning approach, namely ML-ADMM, where each agent uses deep learning to learn the consensus parameters on the coupling branches. The paper also explores the idea of learning only from ADMM runs that exhibit high-quality convergence properties, and proposes filtering mechanisms to select these runs. Experimental results on test cases based on the French system demonstrate the potential of the approach in speeding up the convergence of ADMM significantly.
    Short Blocklength Wiretap Channel Codes via Deep Learning: Design and Performance Evaluation. (arXiv:2206.03477v2 [cs.IT] UPDATED)
    We design short blocklength codes for the Gaussian wiretap channel under information-theoretic security guarantees. Our approach consists in decoupling the reliability and secrecy constraints in our code design. Specifically, we handle the reliability constraint via an autoencoder, and handle the secrecy constraint with hash functions. For blocklengths smaller than or equal to 128, we evaluate through simulations the probability of error at the legitimate receiver and the leakage at the eavesdropper for our code construction. This leakage is defined as the mutual information between the confidential message and the eavesdropper's channel observations, and is empirically measured via a neural network-based mutual information estimator. Our simulation results provide examples of codes with positive secrecy rates that outperform the best known achievable secrecy rates obtained non-constructively for the Gaussian wiretap channel. Additionally, we show that our code design is suitable for the compound and arbitrarily varying Gaussian wiretap channels, for which the channel statistics are not perfectly known but only known to belong to a pre-specified uncertainty set. These models not only capture uncertainty related to channel statistics estimation, but also scenarios where the eavesdropper jams the legitimate transmission or influences its own channel statistics by changing its location.
    Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation. (arXiv:2110.12328v6 [cs.LG] UPDATED)
    Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.
    Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback. (arXiv:2201.13172v2 [cs.LG] UPDATED)
    The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.
    Explainable deep learning for insights in El Ni\~no and river flows. (arXiv:2201.02596v3 [physics.ao-ph] UPDATED)
    The El Ni\~no Southern Oscillation (ENSO) is a semi-periodic fluctuation in sea surface temperature (SST) over the tropical central and eastern Pacific Ocean that influences interannual variability in regional hydrology across the world through long-range dependence or teleconnections. Recent research has demonstrated the value of Deep Learning (DL) methods for improving ENSO prediction as well as Complex Networks (CN) for understanding teleconnections. However, gaps in predictive understanding of ENSO-driven river flows include the black box nature of DL, the use of simple ENSO indices to describe a complex phenomenon and translating DL-based ENSO predictions to river flow predictions. Here we show that eXplainable DL (XDL) methods, based on saliency maps, can extract interpretable predictive information contained in global SST and discover SST information regions and dependence structures relevant for river flows which, in tandem with climate network constructions, enable improved predictive understanding. Our results reveal additional information content in global SST beyond ENSO indices, develop understanding of how SSTs influence river flows, and generate improved river flow prediction, including uncertainty estimation. Observations, reanalysis data, and earth system model simulations are used to demonstrate the value of the XDL-CN based methods for future interannual and decadal scale climate projections.
    A Context-Integrated Transformer-Based Neural Network for Auction Design. (arXiv:2201.12489v3 [cs.GT] UPDATED)
    One of the central problems in auction design is developing an incentive-compatible mechanism that maximizes the auctioneer's expected revenue. While theoretical approaches have encountered bottlenecks in multi-item auctions, recently, there has been much progress on finding the optimal mechanism through deep learning. However, these works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome such limitations by factoring \emph{public} contextual information of bidders and items into the auction learning framework. We propose $\mathtt{CITransNet}$, a context-integrated transformer-based neural network for optimal auction design, which maintains permutation-equivariance over bids and contexts while being able to find asymmetric solutions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal solutions in single-item settings, outperform strong baselines in multi-item auctions, and generalize well to cases other than those in training.
    Linear Connectivity Reveals Generalization Strategies. (arXiv:2205.12411v5 [cs.LG] UPDATED)
    It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.
    Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks. (arXiv:2204.07261v3 [cs.LG] UPDATED)
    We prove linear convergence of gradient descent to a global optimum for the training of deep residual networks with constant layer width and smooth activation function. We show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite $p-$variation with $p=2$. Proofs are based on non-asymptotic estimates for the loss function and for norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.
    A Unification Framework for Euclidean and Hyperbolic Graph Neural Networks. (arXiv:2206.04285v2 [cs.LG] UPDATED)
    Hyperbolic neural networks are able to capture the inherent hierarchy of graph datasets, and consequently a powerful choice of GNNs. However, they entangle multiple incongruent (gyro-)vector spaces within a layer, which makes them limited in terms of generalization and scalability. In this work, we propose to use Poincar\'e disk model as our search space, and apply all approximations on the disk (as if the disk is a tangent space derived from the origin), and thus getting rid of all inter-space transformations. Such an approach enables us to propose a hyperbolic normalization layer, and to further simplify the entire hyperbolic model to a Euclidean model cascaded with our hyperbolic normalization layer. We applied our proposed nonlinear hyperbolic normalization to the current state-of-the-art homogeneous and multi-relational graph networks. We demonstrate that not only does the model leverage the power of Euclidean networks such as interpretability and efficient execution of various model components, but also it outperforms both Euclidean and hyperbolic counterparts in our benchmarks.
    From Kepler to Newton: Explainable AI for Science. (arXiv:2111.12210v7 [cs.AI] UPDATED)
    The Observation--Hypothesis--Prediction--Experimentation loop paradigm for scientific research has been practiced by researchers for years towards scientific discoveries. However, with data explosion in both mega-scale and milli-scale scientific research, it has been sometimes very difficult to manually analyze the data and propose new hypotheses to drive the cycle for scientific discovery. In this paper, we discuss the role of Explainable AI in scientific discovery process by demonstrating an Explainable AI-based paradigm for science discovery. The key is to use Explainable AI to help derive data or model interpretations, hypotheses, as well as scientific discoveries or insights. We show how computational and data-intensive methodology -- together with experimental and theoretical methodology -- can be seamlessly integrated for scientific research. To demonstrate the AI-based science discovery process, and to pay our respect to some of the greatest minds in human history, we show how Kepler's laws of planetary motion and Newton's law of universal gravitation can be rediscovered by (Explainable) AI based on Tycho Brahe's astronomical observation data, whose works were leading the scientific revolution in the 16-17th century. This work also highlights the important role of Explainable AI (as compared to Blackbox AI) in science discovery to help humans prevent or better prepare for the possible technological singularity that may happen in the future, since science is not only about the know how, but also the know why. Presentation of the work is available at https://slideslive.com/38986142/from-kepler-to-newton-explainable-ai-for-science-discovery.
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.
    A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. (arXiv:1809.02362v2 [math.NA] UPDATED)
    Artificial neural networks (ANNs) have very successfully been used in numerical simulations for a series of computational problems ranging from image classification/image recognition, speech recognition, time series analysis, game intelligence, and computational advertising to numerical approximations of partial differential equations (PDEs). Such numerical simulations suggest that ANNs have the capacity to very efficiently approximate high-dimensional functions and, especially, indicate that ANNs seem to admit the fundamental power to overcome the curse of dimensionality when approximating the high-dimensional functions appearing in the above named computational problems. There are a series of rigorous mathematical approximation results for ANNs in the scientific literature. Some of them prove convergence without convergence rates and some even rigorously establish convergence rates but there are only a few special cases where mathematical results can rigorously explain the empirical success of ANNs when approximating high-dimensional functions. The key contribution of this article is to disclose that ANNs can efficiently approximate high-dimensional functions in the case of numerical approximations of Black-Scholes PDEs. More precisely, this work reveals that the number of required parameters of an ANN to approximate the solution of the Black-Scholes PDE grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon > 0$ and the PDE dimension $d \in \mathbb{N}$. We thereby prove, for the first time, that ANNs do indeed overcome the curse of dimensionality in the numerical approximation of Black-Scholes PDEs.
    SUPER-Net: Trustworthy Medical Image Segmentation with Uncertainty Propagation in Encoder-Decoder Networks. (arXiv:2111.05978v3 [eess.IV] UPDATED)
    Deep Learning (DL) holds great promise in reshaping the healthcare industry owing to its precision, efficiency, and objectivity. However, the brittleness of DL models to noisy and out-of-distribution inputs is ailing their deployment in the clinic. Most models produce point estimates without further information about model uncertainty or confidence. This paper introduces a new Bayesian DL framework for uncertainty quantification in segmentation neural networks: SUPER-Net: trustworthy medical image Segmentation with Uncertainty Propagation in Encoder-decodeR Networks. SUPER-Net analytically propagates, using Taylor series approximations, the first two moments (mean and covariance) of the posterior distribution of the model parameters across the nonlinear layers. In particular, SUPER-Net simultaneously learns the mean and covariance without expensive post-hoc Monte Carlo sampling or model ensembling. The output consists of two simultaneous maps: the segmented image and its pixelwise uncertainty map, which corresponds to the covariance matrix of the predictive distribution. We conduct an extensive evaluation of SUPER-Net on medical image segmentation of Magnetic Resonances Imaging and Computed Tomography scans under various noisy and adversarial conditions. Our experiments on multiple benchmark datasets demonstrate that SUPER-Net is more robust to noise and adversarial attacks than state-of-the-art segmentation models. Moreover, the uncertainty map of the proposed SUPER-Net associates low confidence (or equivalently high uncertainty) to patches in the test input images that are corrupted with noise, artifacts, or adversarial attacks. Perhaps more importantly, the model exhibits the ability of self-assessment of its segmentation decisions, notably when making erroneous predictions due to noise or adversarial examples.
    Toward Foundation Models for Earth Monitoring: Generalizable Deep Learning Models for Natural Hazard Segmentation. (arXiv:2301.09318v1 [cs.CV])
    Climate change results in an increased probability of extreme weather events that put societies and businesses at risk on a global scale. Therefore, near real-time mapping of natural hazards is an emerging priority for the support of natural disaster relief, risk management, and informing governmental policy decisions. Recent methods to achieve near real-time mapping increasingly leverage deep learning (DL). However, DL-based approaches are designed for one specific task in a single geographic region based on specific frequency bands of satellite data. Therefore, DL models used to map specific natural hazards struggle with their generalization to other types of natural hazards in unseen regions. In this work, we propose a methodology to significantly improve the generalizability of DL natural hazards mappers based on pre-training on a suitable pre-task. Without access to any data from the target domain, we demonstrate this improved generalizability across four U-Net architectures for the segmentation of unseen natural hazards. Importantly, our method is invariant to geographic differences and differences in the type of frequency bands of satellite data. By leveraging characteristics of unlabeled images from the target domain that are publicly available, our approach is able to further improve the generalization behavior without fine-tuning. Thereby, our approach supports the development of foundation models for earth monitoring with the objective of directly segmenting unseen natural hazards across novel geographic regions given different sources of satellite imagery.
    Synthesis of Compositional Animations from Textual Descriptions. (arXiv:2103.14675v6 [cs.CV] UPDATED)
    "How can we animate 3D-characters from a movie script or move robots by simply telling them what we would like them to do?" "How unstructured and complex can we make a sentence and still generate plausible movements from it?" These are questions that need to be answered in the long-run, as the field is still in its infancy. Inspired by these problems, we present a new technique for generating compositional actions, which handles complex input sentences. Our output is a 3D pose sequence depicting the actions in the input sentence. We propose a hierarchical two-stream sequential model to explore a finer joint-level mapping between natural language sentences and 3D pose sequences corresponding to the given motion. We learn two manifold representations of the motion -- one each for the upper body and the lower body movements. Our model can generate plausible pose sequences for short sentences describing single actions as well as long compositional sentences describing multiple sequential and superimposed actions. We evaluate our proposed model on the publicly available KIT Motion-Language Dataset containing 3D pose data with human-annotated sentences. Experimental results show that our model advances the state-of-the-art on text-based motion synthesis in objective evaluations by a margin of 50%. Qualitative evaluations based on a user study indicate that our synthesized motions are perceived to be the closest to the ground-truth motion captures for both short and compositional sentences.
    Rig Inversion by Training a Differentiable Rig Function. (arXiv:2301.09567v1 [cs.GR])
    Rig inversion is the problem of creating a method that can find the rig parameter vector that best approximates a given input mesh. In this paper we propose to solve this problem by first obtaining a differentiable rig function by training a multi layer perceptron to approximate the rig function. This differentiable rig function can then be used to train a deep learning model of rig inversion.
    Dataset Structural Index: Leveraging a machine's perspective towards visual data. (arXiv:2110.04070v3 [cs.CV] UPDATED)
    With advances in vision and perception architectures, we have realized that working with data is equally crucial, if not more, than the algorithms. Till today, we have trained machines based on our knowledge and perspective of the world. The entire concept of Dataset Structural Index(DSI) revolves around understanding a machine`s perspective of the dataset. With DSI, I show two meta values with which we can get more information over a visual dataset and use it to optimize data, create better architectures, and have an ability to guess which model would work best. These two values are the Variety contribution ratio and Similarity matrix. In the paper, I show many applications of DSI, one of which is how the same level of accuracy can be achieved with the same model architectures trained over less amount of data.
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.
    WDC Products: A Multi-Dimensional Entity Matching Benchmark. (arXiv:2301.09521v1 [cs.LG])
    The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions or they provide for the evaluation of matching methods along a single dimension, for instance the amount of training data. This paper presents WDC Products, an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are (i) amount of corner-cases (ii) generalization to unseen entities, and (iii) development set size. Generalization to unseen entities is a dimension not covered by any of the existing benchmarks yet but is crucial for evaluating the robustness of entity matching systems. WDC Products is based on heterogeneous product data from thousands of e-shops which mark-up products offers using schema.org annotations. Instead of learning how to match entity pairs, entity matching can also be formulated as a multi-class classification task that requires the matcher to recognize individual entities. WDC Products is the first benchmark that provides a pair-wise and a multi-class formulation of the same tasks and thus allows to directly compare the two alternatives. We evaluate WDC Products using several state-of-the-art matching systems, including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching systems struggle with unseen entities to varying degrees. It also shows that some systems are more training data efficient than others.
    A Time Series Approach to Parkinson's Disease Classification from EEG. (arXiv:2301.09568v1 [q-bio.NC])
    Firstly, we present a novel representation for EEG data, a 7-variate series of band power coefficients, which enables the use of (previously inaccessible) time series classification methods. Specifically, we implement the multi-resolution representation-based time series classification method MrSQL. This is deployed on a challenging early-stage Parkinson's dataset that includes wakeful and sleep EEG. Initial results are promising with over 90% accuracy achieved on all EEG data types used. Secondly, we present a framework that enables high-importance data types and brain regions for classification to be identified. Using our framework, we find that, across different EEG data types, it is the Prefrontal brain region that has the most predictive power for the presence of Parkinson's Disease. This outperformance was statistically significant versus ten of the twelve other brain regions (not significant versus adjacent Left Frontal and Right Frontal regions). The Prefrontal region of the brain is important for higher-order cognitive processes and our results align with studies that have shown neural dysfunction in the prefrontal cortex in Parkinson's Disease.
    BayBFed: Bayesian Backdoor Defense for Federated Learning. (arXiv:2301.09508v1 [cs.LG])
    Federated learning (FL) allows participants to jointly train a machine learning model without sharing their private data with others. However, FL is vulnerable to poisoning attacks such as backdoor attacks. Consequently, a variety of defenses have recently been proposed, which have primarily utilized intermediary states of the global model (i.e., logits) or distance of the local models (i.e., L2-norm) from the global model to detect malicious backdoors. However, as these approaches directly operate on client updates, their effectiveness depends on factors such as clients' data distribution or the adversary's attack strategies. In this paper, we introduce a novel and more generic backdoor defense framework, called BayBFed, which proposes to utilize probability distributions over client updates to detect malicious updates in FL: it computes a probabilistic measure over the clients' updates to keep track of any adjustments made in the updates, and uses a novel detection algorithm that can leverage this probabilistic measure to efficiently detect and filter out malicious updates. Thus, it overcomes the shortcomings of previous approaches that arise due to the direct usage of client updates; as our probabilistic measure will include all aspects of the local client training strategies. BayBFed utilizes two Bayesian Non-Parametric extensions: (i) a Hierarchical Beta-Bernoulli process to draw a probabilistic measure given the clients' updates, and (ii) an adaptation of the Chinese Restaurant Process (CRP), referred by us as CRP-Jensen, which leverages this probabilistic measure to detect and filter out malicious updates. We extensively evaluate our defense approach on five benchmark datasets: CIFAR10, Reddit, IoT intrusion detection, MNIST, and FMNIST, and show that it can effectively detect and eliminate malicious updates in FL without deteriorating the benign performance of the global model.
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.
    FInC Flow: Fast and Invertible $k \times k$ Convolutions for Normalizing Flows. (arXiv:2301.09266v1 [cs.CV])
    Invertible convolutions have been an essential element for building expressive normalizing flow-based generative models since their introduction in Glow. Several attempts have been made to design invertible $k \times k$ convolutions that are efficient in training and sampling passes. Though these attempts have improved the expressivity and sampling efficiency, they severely lagged behind Glow which used only $1 \times 1$ convolutions in terms of sampling time. Also, many of the approaches mask a large number of parameters of the underlying convolution, resulting in lower expressivity on a fixed run-time budget. We propose a $k \times k$ convolutional layer and Deep Normalizing Flow architecture which i.) has a fast parallel inversion algorithm with running time O$(n k^2)$ ($n$ is height and width of the input image and k is kernel size), ii.) masks the minimal amount of learnable parameters in a layer. iii.) gives better forward pass and sampling times comparable to other $k \times k$ convolution-based models on real-world benchmarks. We provide an implementation of the proposed parallel algorithm for sampling using our invertible convolutions on GPUs. Benchmarks on CIFAR-10, ImageNet, and CelebA datasets show comparable performance to previous works regarding bits per dimension while significantly improving the sampling time.
    Estimating the energy requirements for long term memory formation. (arXiv:2301.09565v1 [q-bio.NC])
    Brains consume metabolic energy to process information, but also to store memories. The energy required for memory formation can be substantial, for instance in fruit flies memory formation leads to a shorter lifespan upon subsequent starvation (Mery and Kawecki, 2005). Here we estimate that the energy required corresponds to about 10mJ/bit and compare this to biophysical estimates as well as energy requirements in computer hardware. We conclude that biological memory storage is expensive, but the reason behind it is not known.
    Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction. (arXiv:2301.08951v1 [cs.CV])
    When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when the object is completely occluded from partial viewpoints. Meanwhile, humans can imagine the novel views after observing multiple viewpoints. The remarkable recent advance in multi-view object-centric learning leaves some problems: 1) the partially or completely occluded shape of objects can not be well reconstructed. 2) the novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit view rules. This makes the agent fail to perform like humans. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of the object accurately, we enhance the disentanglement between different latent representations: view latent representations are jointly inferred based on the Transformer and then cooperate with the sequential extension of Slot Attention to learn object-centric representations. The model also achieves the new ability: Gaussian processes are employed as priors of view latent variables for generation and novel-view prediction without viewpoint annotations. Experiments on multiple specifically designed synthetic datasets have shown that the proposed model can 1) make the video decomposition, 2) reconstruct the complete shapes of objects, and 3) make the novel viewpoint prediction without viewpoint annotations.
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.
    The Entoptic Field Camera as Metaphor-Driven Research-through-Design with AI Technologies. (arXiv:2301.09545v1 [cs.HC])
    Artificial intelligence (AI) technologies are widely deployed in smartphone photography; and prompt-based image synthesis models have rapidly become commonplace. In this paper, we describe a Research-through-Design (RtD) project which explores this shift in the means and modes of image production via the creation and use of the Entoptic Field Camera. Entoptic phenomena usually refer to perceptions of floaters or bright blue dots stemming from the physiological interplay of the eye and brain. We use the term entoptic as a metaphor to investigate how the material interplay of data and models in AI technologies shapes human experiences of reality. Through our case study using first-person design and a field study, we offer implications for critical, reflective, more-than-human and ludic design to engage AI technologies; the conceptualisation of an RtD research space which contributes to AI literacy discourses; and outline a research trajectory concerning materiality and design affordances of AI technologies.
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.
    ECGAN: Self-supervised generative adversarial network for electrocardiography. (arXiv:2301.09496v1 [cs.LG])
    High-quality synthetic data can support the development of effective predictive models for biomedical tasks, especially in rare diseases or when subject to compelling privacy constraints. These limitations, for instance, negatively impact open access to electrocardiography datasets about arrhythmias. This work introduces a self-supervised approach to the generation of synthetic electrocardiography time series which is shown to promote morphological plausibility. Our model (ECGAN) allows conditioning the generative process for specific rhythm abnormalities, enhancing synchronization and diversity across samples with respect to literature models. A dedicated sample quality assessment framework is also defined, leveraging arrhythmia classifiers. The empirical results highlight a substantial improvement against state-of-the-art generative models for sequences and audio synthesis.
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.
    DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion. (arXiv:2301.09474v1 [cs.LG])
    Real-world data generation often involves complex inter-dependencies among instances, violating the IID-data hypothesis of standard learning paradigms and posing a challenge for uncovering the geometric structures for learning desired instance representations. To this end, we introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states that progressively incorporate other instances' information by their interactions. The diffusion process is constrained by descent criteria w.r.t.~a principled energy function that characterizes the global consistency of instance representations over latent structures. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs, which gives rise to a new class of neural encoders, dubbed as DIFFormer (diffusion-based Transformers), with two instantiations: a simple version with linear complexity for prohibitive instance numbers, and an advanced version for learning complex structures. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as node classification on large graphs, semi-supervised image/text classification, and spatial-temporal dynamics prediction.
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.
    Digital Twins for Marine Operations: A Brief Review on Their Implementation. (arXiv:2301.09574v1 [eess.SY])
    While the concept of a digital twin to support maritime operations is gaining attention for predictive maintenance, real-time monitoring, control, and overall process optimization, clarity on its implementation is missing in the literature. Therefore, in this review we show how different authors implemented their digital twins, discuss our findings, and finally give insights on future research directions.
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.
    M22: A Communication-Efficient Algorithm for Federated Learning Inspired by Rate-Distortion. (arXiv:2301.09269v1 [cs.LG])
    In federated learning (FL), the communication constraint between the remote learners and the Parameter Server (PS) is a crucial bottleneck. For this reason, model updates must be compressed so as to minimize the loss in accuracy resulting from the communication constraint. This paper proposes ``\emph{${\bf M}$-magnitude weighted $L_{\bf 2}$ distortion + $\bf 2$ degrees of freedom''} (M22) algorithm, a rate-distortion inspired approach to gradient compression for federated training of deep neural networks (DNNs). In particular, we propose a family of distortion measures between the original gradient and the reconstruction we referred to as ``$M$-magnitude weighted $L_2$'' distortion, and we assume that gradient updates follow an i.i.d. distribution -- generalized normal or Weibull, which have two degrees of freedom. In both the distortion measure and the gradient, there is one free parameter for each that can be fitted as a function of the iteration number. Given a choice of gradient distribution and distortion measure, we design the quantizer minimizing the expected distortion in gradient reconstruction. To measure the gradient compression performance under a communication constraint, we define the \emph{per-bit accuracy} as the optimal improvement in accuracy that one bit of communication brings to the centralized model over the training period. Using this performance measure, we systematically benchmark the choice of gradient distribution and distortion measure. We provide substantial insights on the role of these choices and argue that significant performance improvements can be attained using such a rate-distortion inspired compressor.
    Explaining the effects of non-convergent sampling in the training of Energy-Based Models. (arXiv:2301.09428v1 [cs.LG])
    In this paper, we quantify the impact of using non-convergent Markov chains to train Energy-Based models (EBMs). In particular, we show analytically that EBMs trained with non-persistent short runs to estimate the gradient can perfectly reproduce a set of empirical statistics of the data, not at the level of the equilibrium measure, but through a precise dynamical process. Our results provide a first-principles explanation for the observations of recent works proposing the strategy of using short runs starting from random initial conditions as an efficient way to generate high-quality samples in EBMs, and lay the groundwork for using EBMs as diffusion models. After explaining this effect in generic EBMs, we analyze two solvable models in which the effect of the non-convergent sampling in the trained parameters can be described in detail. Finally, we test these predictions numerically on the Boltzmann machine.
    StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. (arXiv:2301.09515v1 [cs.LG])
    Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.
    DASTSiam: Spatio-Temporal Fusion and Discriminative Augmentation for Improved Siamese Tracking. (arXiv:2301.09063v1 [cs.CV])
    Tracking tasks based on deep neural networks have greatly improved with the emergence of Siamese trackers. However, the appearance of targets often changes during tracking, which can reduce the robustness of the tracker when facing challenges such as aspect ratio change, occlusion, and scale variation. In addition, cluttered backgrounds can lead to multiple high response points in the response map, leading to incorrect target positioning. In this paper, we introduce two transformer-based modules to improve Siamese tracking called DASTSiam: the spatio-temporal (ST) fusion module and the Discriminative Augmentation (DA) module. The ST module uses cross-attention based accumulation of historical cues to improve robustness against object appearance changes, while the DA module associates semantic information between the template and search region to improve target discrimination. Moreover, Modifying the label assignment of anchors also improves the reliability of the object location. Our modules can be used with all Siamese trackers and show improved performance on several public datasets through comparative and ablation experiments.
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.
    A Framework for Evaluating the Impact of Food Security Scenarios. (arXiv:2301.09320v1 [cs.LG])
    This study proposes an approach for predicting the impacts of scenarios on food security and demonstrates its application in a case study. The approach involves two main steps: (1) scenario definition, in which the end user specifies the assumptions and impacts of the scenario using a scenario template, and (2) scenario evaluation, in which a Vector Autoregression (VAR) model is used in combination with Monte Carlo simulation to generate predictions for the impacts of the scenario based on the defined assumptions and impacts. The case study is based on a proprietary time series food security database created using data from the Food and Agriculture Organization of the United Nations (FAOSTAT), the World Bank, and the United States Department of Agriculture (USDA). The database contains a wide range of data on various indicators of food security, such as production, trade, consumption, prices, availability, access, and nutritional value. The results show that the proposed approach can be used to predict the potential impacts of scenarios on food security and that the proprietary time series food security database can be used to support this approach. The study provides specific insights on how this approach can inform decision-making processes related to food security such as food prices and availability in the case study region.
    An iterative multi-fidelity approach for model order reduction of multi-dimensional input parametric PDE systems. (arXiv:2301.09483v1 [math.NA])
    We propose a parametric sampling strategy for the reduction of large-scale PDE systems with multidimensional input parametric spaces by leveraging models of different fidelity. The design of this methodology allows a user to adaptively sample points ad hoc from a discrete training set with no prior requirement of error estimators. It is achieved by exploiting low-fidelity models throughout the parametric space to sample points using an efficient sampling strategy, and at the sampled parametric points, high-fidelity models are evaluated to recover the reduced basis functions. The low-fidelity models are then adapted with the reduced order models ( ROMs) built by projection onto the subspace spanned by the recovered basis functions. The process continues until the low-fidelity model can represent the high-fidelity model adequately for all the parameters in the parametric space. Since the proposed methodology leverages the use of low-fidelity models to assimilate the solution database, it significantly reduces the computational cost in the offline stage. The highlight of this article is to present the construction of the initial low-fidelity model, and a sampling strategy based on the discrete empirical interpolation method (DEIM). We test this approach on a 2D steady-state heat conduction problem for two different input parameters and make a qualitative comparison with the classical greedy reduced basis method (RBM), and further test on a 9-dimensional parametric non-coercive elliptic problem and analyze the computational performance based on different tuning of greedy selection of points.
    Speeding Up BatchBALD: A k-BALD Family of Approximations for Active Learning. (arXiv:2301.09490v1 [cs.LG])
    Active learning is a powerful method for training machine learning models with limited labeled data. One commonly used technique for active learning is BatchBALD, which uses Bayesian neural networks to find the most informative points to label in a pool set. However, BatchBALD can be very slow to compute, especially for larger datasets. In this paper, we propose a new approximation, k-BALD, which uses k-wise mutual information terms to approximate BatchBALD, making it much less expensive to compute. Results on the MNIST dataset show that k-BALD is significantly faster than BatchBALD while maintaining similar performance. Additionally, we also propose a dynamic approach for choosing k based on the quality of the approximation, making it more efficient for larger datasets.
    A Simple Recipe for Competitive Low-compute Self supervised Vision Models. (arXiv:2301.09451v1 [cs.CV])
    Self-supervised methods in vision have been mostly focused on large architectures as they seem to suffer from a significant performance drop for smaller architectures. In this paper, we propose a simple self-supervised distillation technique that can train high performance low-compute neural networks. Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model. Thus, we call our method Replace one Branch (RoB) as it simply replaces one branch of the joint-embedding training with a large teacher model. RoB is widely applicable to a number of architectures such as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV or iBOT. When pretraining on the ImageNet dataset, RoB yields models that compete with supervised knowledge distillation. When applied to MSN, RoB produces students with strong semi-supervised capabilities. Finally, our best ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by $2.3\%$ and are on par or better than a supervised distilled DeiT on five downstream transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We hope RoB enables practical self-supervision at smaller scale.
    New Insights into Multi-Calibration. (arXiv:2301.08837v1 [cs.LG])
    We identify a novel connection between the recent literature on multi-group fairness for prediction algorithms and well-established notions of graph regularity from extremal graph theory. We frame our investigation using new, statistical distance-based variants of multi-calibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new multi-calibration algorithms with improved complexity in certain parameter regimes, and to a generalization of a state-of-the-art result on omniprediction. Along the way, we also unify several prior algorithms for achieving multi-group fairness, as well as their analyses, through the lens of no-regret learning.
    LSTM and CNN application for core-collapse supernova search in gravitational wave real data. (arXiv:2301.09387v1 [astro-ph.IM])
    $Context.$ Core-collapse supernovae (CCSNe) are expected to emit gravitational wave signals that could be detected by current and future generation interferometers within the Milky Way and nearby galaxies. The stochastic nature of the signal arising from CCSNe requires alternative detection methods to matched filtering. $Aims.$ We aim to show the potential of machine learning (ML) for multi-label classification of different CCSNe simulated signals and noise transients using real data. We compared the performance of 1D and 2D convolutional neural networks (CNNs) on single and multiple detector data. For the first time, we tested multi-label classification also with long short-term memory (LSTM) networks. $Methods.$ We applied a search and classification procedure for CCSNe signals, using an event trigger generator, the Wavelet Detection Filter (WDF), coupled with ML. We used time series and time-frequency representations of the data as inputs to the ML models. To compute classification accuracies, we simultaneously injected, at detectable distance of 1\,kpc, CCSN waveforms, obtained from recent hydrodynamical simulations of neutrino-driven core-collapse, onto interferometer noise from the O2 LIGO and Virgo science run. $Results.$ We compared the performance of the three models on single detector data. We then merged the output of the models for single detector classification of noise and astrophysical transients, obtaining overall accuracies for LIGO ($\sim99\%$) and ($\sim80\%$) for Virgo. We extended our analysis to the multi-detector case using triggers coincident among the three ITFs and achieved an accuracy of $\sim98\%$.
    Ordinal Regression for Difficulty Estimation of StepMania Levels. (arXiv:2301.09485v1 [cs.LG])
    StepMania is a popular open-source clone of a rhythm-based video game. As is common in popular games, there is a large number of community-designed levels. It is often difficult for players and level authors to determine the difficulty level of such community contributions. In this work, we formalize and analyze the difficulty prediction task on StepMania levels as an ordinal regression (OR) task. We standardize a more extensive and diverse selection of this data resulting in five data sets, two of which are extensions of previous work. We evaluate many competitive OR and non-OR models, demonstrating that neural network-based models significantly outperform the state of the art and that StepMania-level data makes for an excellent test bed for deep OR models. We conclude with a user experiment showing our trained models' superiority over human labeling.
    Hierarchically branched diffusion models for efficient and interpretable multi-class conditional generation. (arXiv:2212.10777v2 [cs.LG] UPDATED)
    Diffusion models have achieved justifiable popularity by attaining state-of-the-art performance in generating realistic objects, including when conditioning generation on labels. Current diffusion models are universally linear in nature, modeling diffusion identically for objects of all classes. For the multi-class conditional generation problem, we propose a novel, structurally unique framework of diffusion models which are hierarchically branched according to the inherent relationships between classes. In this work, we showcase several advantages of branched diffusion models. We demonstrate that branched models generate samples more efficiently, and are more easily extended to novel classes in a continual-learning setting. We also show that branched models enjoy a unique interpretability that offers insight into the modeled data distribution. Branched diffusion models represent an alternative paradigm to their traditional linear counterparts, and can have large impacts in how we use diffusion models for efficient generation, online learning, and scientific discovery.
    LF-checker: Machine Learning Acceleration of Bounded Model Checking for Concurrency Verification (Competition Contribution). (arXiv:2301.09142v1 [cs.LG])
    We describe and evaluate LF-checker, a metaverifier tool based on machine learning. It extracts multiple features of the program under test and predicts the optimal configuration (flags) of a bounded model checker with a decision tree. Our current work is specialised in concurrency verification and employs ESBMC as a back-end verification engine. In the paper, we demonstrate that LF-checker achieves better results than the default configuration of the underlying verification engine.
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.
    SMDDH: Singleton Mention detection using Deep Learning in Hindi Text. (arXiv:2301.09361v1 [cs.CL])
    Mention detection is an important component of coreference resolution system, where mentions such as name, nominal, and pronominals are identified. These mentions can be purely coreferential mentions or singleton mentions (non-coreferential mentions). Coreferential mentions are those mentions in a text that refer to the same entities in a real world. Whereas, singleton mentions are mentioned only once in the text and do not participate in the coreference as they are not mentioned again in the following text. Filtering of these singleton mentions can substantially improve the performance of a coreference resolution process. This paper proposes a singleton mention detection module based on a fully connected network and a Convolutional neural network for Hindi text. This model utilizes a few hand-crafted features and context information, and word embedding for words. The coreference annotated Hindi dataset comprising of 3.6K sentences, and 78K tokens are used for the task. In terms of Precision, Recall, and F-measure, the experimental findings obtained are excellent.
    A Comprehensive Survey on Heart Sound Analysis in the Deep Learning Era. (arXiv:2301.09362v1 [cs.SD])
    Heart sound auscultation has been demonstrated to be beneficial in clinical usage for early screening of cardiovascular diseases. Due to the high requirement of well-trained professionals for auscultation, automatic auscultation benefiting from signal processing and machine learning can help auxiliary diagnosis and reduce the burdens of training professional clinicians. Nevertheless, classic machine learning is limited to performance improvement in the era of big data. Deep learning has achieved better performance than classic machine learning in many research fields, as it employs more complex model architectures with stronger capability of extracting effective representations. Deep learning has been successfully applied to heart sound analysis in the past years. As most review works about heart sound analysis were given before 2017, the present survey is the first to work on a comprehensive overview to summarise papers on heart sound analysis with deep learning in the past six years 2017--2022. We introduce both classic machine learning and deep learning for comparison, and further offer insights about the advances and future research directions in deep learning for heart sound analysis.
    MATT: Multimodal Attention Level Estimation for e-learning Platforms. (arXiv:2301.09174v1 [cs.CV])
    This work presents a new multimodal system for remote attention level estimation based on multimodal face analysis. Our multimodal approach uses different parameters and signals obtained from the behavior and physiological processes that have been related to modeling cognitive load such as faces gestures (e.g., blink rate, facial actions units) and user actions (e.g., head pose, distance to the camera). The multimodal system uses the following modules based on Convolutional Neural Networks (CNNs): Eye blink detection, head pose estimation, facial landmark detection, and facial expression features. First, we individually evaluate the proposed modules in the task of estimating the student's attention level captured during online e-learning sessions. For that we trained binary classifiers (high or low attention) based on Support Vector Machines (SVM) for each module. Secondly, we find out to what extent multimodal score level fusion improves the attention level estimation. The mEBAL database is used in the experimental framework, a public multi-modal database for attention level estimation obtained in an e-learning environment that contains data from 38 users while conducting several e-learning tasks of variable difficulty (creating changes in student cognitive loads).
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.
    Enabling Hard Constraints in Differentiable Neural Network and Accelerator Co-Exploration. (arXiv:2301.09312v1 [cs.LG])
    Co-exploration of an optimal neural architecture and its hardware accelerator is an approach of rising interest which addresses the computational cost problem, especially in low-profile systems. The large co-exploration space is often handled by adopting the idea of differentiable neural architecture search. However, despite the superior search efficiency of the differentiable co-exploration, it faces a critical challenge of not being able to systematically satisfy hard constraints such as frame rate. To handle the hard constraint problem of differentiable co-exploration, we propose HDX, which searches for hard-constrained solutions without compromising the global design objectives. By manipulating the gradients in the interest of the given hard constraint, high-quality solutions satisfying the constraint can be obtained.
    On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment. (arXiv:2301.09420v1 [cs.LG])
    Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which are based on-policy and off-policy RL approaches. We compare our results with the state-of-the-art results for this challenge and discuss the potential areas of improvement while discussing the explainability of these approaches in conjunction with waypoints in the SMARTS environment.
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.
    Prompt Federated Learning for Weather Forecasting: Toward Foundation Models on Meteorological Data. (arXiv:2301.09152v1 [cs.LG])
    To tackle the global climate challenge, it urgently needs to develop a collaborative platform for comprehensive weather forecasting on large-scale meteorological data. Despite urgency, heterogeneous meteorological sensors across countries and regions, inevitably causing multivariate heterogeneity and data exposure, become the main barrier. This paper develops a foundation model across regions capable of understanding complex meteorological data and providing weather forecasting. To relieve the data exposure concern across regions, a novel federated learning approach has been proposed to collaboratively learn a brand-new spatio-temporal Transformer-based foundation model across participants with heterogeneous meteorological data. Moreover, a novel prompt learning mechanism has been adopted to satisfy low-resourced sensors' communication and computational constraints. The effectiveness of the proposed method has been demonstrated on classical weather forecasting tasks using three meteorological datasets with multivariate time series.
    Lower Bounds on Learning Pauli Channels. (arXiv:2301.09192v1 [quant-ph])
    Understanding the noise affecting a quantum device is of fundamental importance for scaling quantum technologies. A particularly important class of noise models is that of Pauli channels, as randomized compiling techniques can effectively bring any quantum channel to this form and are significantly more structured than general quantum channels. In this paper, we show fundamental lower bounds on the sample complexity for learning Pauli channels in diamond norm with unentangled measurements. We consider both adaptive and non-adaptive strategies. In the non-adaptive setting, we show a lower bound of $\Omega(2^{3n}\epsilon^{-2})$ to learn an $n$-qubit Pauli channel. In particular, this shows that the recently introduced learning procedure by Flammia and Wallman is essentially optimal. In the adaptive setting, we show a lower bound of $\Omega(2^{2.5n}\epsilon^{-2})$ for $\epsilon=\mathcal{O}(2^{-n})$, and a lower bound of $\Omega(2^{2n}\epsilon^{-2} )$ for any $\epsilon > 0$. This last lower bound even applies for arbitrarily many sequential uses of the channel, as long as they are only interspersed with other unital operations.
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo
    Abstracting Imperfect Information Away from Two-Player Zero-Sum Games. (arXiv:2301.09159v1 [cs.GT])
    In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem -- thus, computing them can be treated as perfect information problems. Because these regularized equilibria can be made arbitrarily close to Nash equilibria, our result opens the door to a new perspective on solving two-player zero-sum games and, in particular, yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning approaches.
    Optimising complexity of CNN models for resource constrained devices: QRS detection case study. (arXiv:2301.09232v1 [cs.LG])
    Traditional DL models are complex and resource hungry and thus, care needs to be taken in designing Internet of (medical) things (IoT, or IoMT) applications balancing efficiency-complexity trade-off. Recent IoT solutions tend to avoid using deep-learning methods due to such complexities, and rather classical filter-based methods are commonly used. We hypothesize that a shallow CNN model can offer satisfactory level of performance in combination by leveraging other essential solution-components, such as post-processing that is suitable for resource constrained environment. In an IoMT application context, QRS-detection and R-peak localisation from ECG signal as a case study, the complexities of CNN models and post-processing were varied to identify a set of combinations suitable for a range of target resource-limited environments. To the best of our knowledge, finding a deploy-able configuration, by incrementally increasing the CNN model complexity, as required to match the target's resource capacity, and leveraging the strength of post-processing, is the first of its kind. The results show that a shallow 2-layer CNN with a suitable post-processing can achieve $>$90\% F1-score, and the scores continue to improving for 8-32 layer CNNs, which can be used to profile target constraint environment. The outcome shows that it is possible to design an optimal DL solution with known target performance characteristics and resource (computing capacity, and memory) constraints.
    A Survey on Actionable Knowledge. (arXiv:2301.09317v1 [cs.LG])
    Actionable Knowledge Discovery (AKD) is a crucial aspect of data mining that is gaining popularity and being applied in a wide range of domains. This is because AKD can extract valuable insights and information, also known as knowledge, from large datasets. The goal of this paper is to examine different research studies that focus on various domains and have different objectives. The paper will review and discuss the methods used in these studies in detail. AKD is a process of identifying and extracting actionable insights from data, which can be used to make informed decisions and improve business outcomes. It is a powerful tool for uncovering patterns and trends in data that can be used for various applications such as customer relationship management, marketing, and fraud detection. The research studies reviewed in this paper will explore different techniques and approaches for AKD in different domains, such as healthcare, finance, and telecommunications. The paper will provide a thorough analysis of the current state of AKD in the field and will review the main methods used by various research studies. Additionally, the paper will evaluate the advantages and disadvantages of each method and will discuss any novel or new solutions presented in the field. Overall, this paper aims to provide a comprehensive overview of the methods and techniques used in AKD and the impact they have on different domains.
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.
    Differentially Private Natural Language Models: Recent Advances and Future Directions. (arXiv:2301.09112v1 [cs.CL])
    Recent developments in deep learning have led to great success in various natural language processing (NLP) tasks. However, these applications may involve data that contain sensitive information. Therefore, how to achieve good performance while also protect privacy of sensitive data is a crucial challenge in NLP. To preserve privacy, Differential Privacy (DP), which can prevent reconstruction attacks and protect against potential side knowledge, is becoming a de facto technique for private data analysis. In recent years, NLP in DP models (DP-NLP) has been studied from different perspectives, which deserves a comprehensive review. In this paper, we provide the first systematic review of recent advances on DP deep learning models in NLP. In particular, we first discuss some differences and additional challenges of DP-NLP compared with the standard DP deep learning. Then we investigate some existing work on DP-NLP and present its recent developments from two aspects: gradient perturbation based methods and embedding vector perturbation based methods. We also discuss some challenges and future directions of this topic.
    Learning to Reject with a Fixed Predictor: Application to Decontextualization. (arXiv:2301.09044v1 [cs.LG])
    We study the problem of classification with a reject option for a fixed predictor, applicable in natural language processing. \ignore{where many correct labels are often possible} We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong $H$-consistency guarantee. For evaluation, we choose the \textit{decontextualization} task, and provide a manually-labelled dataset of $2\mathord,000$ examples. Our algorithm significantly outperforms the baselines considered, with a $\sim\!\!25\%$ improvement in coverage when halving the error rate, which is only $\sim\!\! 3 \%$ away from the theoretical limit.
    Provable Unrestricted Adversarial Training without Compromise with Generalizability. (arXiv:2301.09069v1 [cs.LG])
    Adversarial training (AT) is widely considered as the most promising strategy to defend against adversarial attacks and has drawn increasing interest from researchers. However, the existing AT methods still suffer from two challenges. First, they are unable to handle unrestricted adversarial examples (UAEs), which are built from scratch, as opposed to restricted adversarial examples (RAEs), which are created by adding perturbations bound by an $l_p$ norm to observed examples. Second, the existing AT methods often achieve adversarial robustness at the expense of standard generalizability (i.e., the accuracy on natural examples) because they make a tradeoff between them. To overcome these challenges, we propose a unique viewpoint that understands UAEs as imperceptibly perturbed unobserved examples. Also, we find that the tradeoff results from the separation of the distributions of adversarial examples and natural examples. Based on these ideas, we propose a novel AT approach called Provable Unrestricted Adversarial Training (PUAT), which can provide a target classifier with comprehensive adversarial robustness against both UAE and RAE, and simultaneously improve its standard generalizability. Particularly, PUAT utilizes partially labeled data to achieve effective UAE generation by accurately capturing the natural data distribution through a novel augmented triple-GAN. At the same time, PUAT extends the traditional AT by introducing the supervised loss of the target classifier into the adversarial loss and achieves the alignment between the UAE distribution, the natural data distribution, and the distribution learned by the classifier, with the collaboration of the augmented triple-GAN. Finally, the solid theoretical analysis and extensive experiments conducted on widely-used benchmarks demonstrate the superiority of PUAT.
    Relaxed Models for Adversarial Streaming: The Advice Model and the Bounded Interruptions Model. (arXiv:2301.09203v1 [cs.DS])
    Streaming algorithms are typically analyzed in the oblivious setting, where we assume that the input stream is fixed in advance. Recently, there is a growing interest in designing adversarially robust streaming algorithms that must maintain utility even when the input stream is chosen adaptively and adversarially as the execution progresses. While several fascinating results are known for the adversarial setting, in general, it comes at a very high cost in terms of the required space. Motivated by this, in this work we set out to explore intermediate models that allow us to interpolate between the oblivious and the adversarial models. Specifically, we put forward the following two models: (1) *The advice model*, in which the streaming algorithm may occasionally ask for one bit of advice. (2) *The bounded interruptions model*, in which we assume that the adversary is only partially adaptive. We present both positive and negative results for each of these two models. In particular, we present generic reductions from each of these models to the oblivious model. This allows us to design robust algorithms with significantly improved space complexity compared to what is known in the plain adversarial model.
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.
    Energy Prediction using Federated Learning. (arXiv:2301.09165v1 [cs.LG])
    In this work, we demonstrate the viability of using federated learning to successfully predict energy consumption as well as solar production for all households within a certain network using low-power and low-space consuming embedded devices. We also demonstrate our prediction performance improving over time without the need for sharing private consumer energy data. We simulate a system with four nodes using data for one year to show this.
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.
    Towards NeuroAI: Introducing Neuronal Diversity into Artificial Neural Networks. (arXiv:2301.09245v1 [cs.NE])
    Throughout history, the development of artificial intelligence, particularly artificial neural networks, has been open to and constantly inspired by the increasingly deepened understanding of the brain, such as the inspiration of neocognitron, which is the pioneering work of convolutional neural networks. Per the motives of the emerging field: NeuroAI, a great amount of neuroscience knowledge can help catalyze the next generation of AI by endowing a network with more powerful capabilities. As we know, the human brain has numerous morphologically and functionally different neurons, while artificial neural networks are almost exclusively built on a single neuron type. In the human brain, neuronal diversity is an enabling factor for all kinds of biological intelligent behaviors. Since an artificial network is a miniature of the human brain, introducing neuronal diversity should be valuable in terms of addressing those essential problems of artificial networks such as efficiency, interpretability, and memory. In this Primer, we first discuss the preliminaries of biological neuronal diversity and the characteristics of information transmission and processing in a biological neuron. Then, we review studies of designing new neurons for artificial networks. Next, we discuss what gains can neuronal diversity bring into artificial networks and exemplary applications in several important fields. Lastly, we discuss the challenges and future directions of neuronal diversity to explore the potential of NeuroAI.
    MEMO : Accelerating Transformers with Memoization on Big Memory Systems. (arXiv:2301.09262v1 [cs.PF])
    Transformers gain popularity because of their superior prediction accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing work to accelerate transformer inferences has limitations because of the changes to transformer architectures or the need for specialized hardware. In this paper, we identify the opportunities of using memoization to accelerate the attention mechanism in transformers without the above limitation. Built upon a unique observation that there is a rich similarity in attention computation across inference sequences, we build an attention database upon the emerging big memory system. We introduce the embedding technique to find semantically similar inputs to identify computation similarity. We also introduce a series of techniques such as memory mapping and selective memoization to avoid memory copy and unnecessary overhead. We enable 21% performance improvement on average (up to 68%) with the TB-scale attention database and with ignorable loss in inference accuracy.
    Practical Adversarial Attacks Against AI-Driven Power Allocation in a Distributed MIMO Network. (arXiv:2301.09305v1 [eess.SP])
    In distributed multiple-input multiple-output (D-MIMO) networks, power control is crucial to optimize the spectral efficiencies of users and max-min fairness (MMF) power control is a commonly used strategy as it satisfies uniform quality-of-service to all users. The optimal solution of MMF power control requires high complexity operations and hence deep neural network based artificial intelligence (AI) solutions are proposed to decrease the complexity. Although quite accurate models can be achieved by using AI, these models have some intrinsic vulnerabilities against adversarial attacks where carefully crafted perturbations are applied to the input of the AI model. In this work, we show that threats against the target AI model which might be originated from malicious users or radio units can substantially decrease the network performance by applying a successful adversarial sample, even in the most constrained circumstances. We also demonstrate that the risk associated with these kinds of adversarial attacks is higher than the conventional attack threats. Detailed simulations reveal the effectiveness of adversarial attacks and the necessity of smart defense techniques.
    BallGAN: 3D-aware Image Synthesis with a Spherical Background. (arXiv:2301.09091v1 [cs.CV])
    3D-aware GANs aim to synthesize realistic 3D scenes such that they can be rendered in arbitrary perspectives to produce images. Although previous methods produce realistic images, they suffer from unstable training or degenerate solutions where the 3D geometry is unnatural. We hypothesize that the 3D geometry is underdetermined due to the insufficient constraint, i.e., being classified as real image to the discriminator is not enough. To solve this problem, we propose to approximate the background as a spherical surface and represent a scene as a union of the foreground placed in the sphere and the thin spherical background. It reduces the degree of freedom in the background field. Accordingly, we modify the volume rendering equation and incorporate dedicated constraints to design a novel 3D-aware GAN framework named BallGAN. BallGAN has multiple advantages as follows. 1) It produces more reasonable 3D geometry; the images of a scene across different viewpoints have better photometric consistency and fidelity than the state-of-the-art methods. 2) The training becomes much more stable. 3) The foreground can be separately rendered on top of different arbitrary backgrounds.
    Learning to Linearize Deep Neural Networks for Secure and Efficient Private Inference. (arXiv:2301.09254v1 [cs.CV])
    The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers' ReLU sensitivity, enabling mitigation of the time-consuming manual efforts in identifying the same. Based on this sensitivity, we then present SENet, a three-stage training method that for a given ReLU budget, automatically assigns per-layer ReLU counts, decides the ReLU locations for each layer's activation map, and trains a model with significantly fewer ReLUs to potentially yield latency and communication efficient PI. Experimental evaluations with multiple models on various datasets show SENet's superior performance both in terms of reduced ReLUs and improved classification accuracy compared to existing alternatives. In particular, SENet can yield models that require up to ~2x fewer ReLUs while yielding similar accuracy. For a similar ReLU budget SENet can yield models with ~2.32% improved classification accuracy, evaluated on CIFAR-100.
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.
    Condition monitoring and anomaly detection in cyber-physical systems. (arXiv:2301.09030v1 [cs.LG])
    The modern industrial environment is equipping myriads of smart manufacturing machines where the state of each device can be monitored continuously. Such monitoring can help identify possible future failures and develop a cost-effective maintenance plan. However, it is a daunting task to perform early detection with low false positives and negatives from the huge volume of collected data. This requires developing a holistic machine learning framework to address the issues in condition monitoring of high-priority components and develop efficient techniques to detect anomalies that can detect and possibly localize the faulty components. This paper presents a comparative analysis of recent machine learning approaches for robust, cost-effective anomaly detection in cyber-physical systems. While detection has been extensively studied, very few researchers have analyzed the localization of the anomalies. We show that supervised learning outperforms unsupervised algorithms. For supervised cases, we achieve near-perfect accuracy of 98 percent (specifically for tree-based algorithms). In contrast, the best-case accuracy in the unsupervised cases was 63 percent :the area under the receiver operating characteristic curve (AUC) exhibits similar outcomes as an additional metric.
    Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer. (arXiv:2301.09255v1 [cs.CV])
    In recent years, privacy-preserving methods for deep learning have become an urgent problem. Accordingly, we propose the combined use of federated learning (FL) and encrypted images for privacy-preserving image classification under the use of the vision transformer (ViT). The proposed method allows us not only to train models over multiple participants without directly sharing their raw data but to also protect the privacy of test (query) images for the first time. In addition, it can also maintain the same accuracy as normally trained models. In an experiment, the proposed method was demonstrated to well work without any performance degradation on the CIFAR-10 and CIFAR-100 datasets.
    Statistically Optimal Robust Mean and Covariance Estimation for Anisotropic Gaussians. (arXiv:2301.09024v1 [math.ST])
    Assume that $X_{1}, \ldots, X_{N}$ is an $\varepsilon$-contaminated sample of $N$ independent Gaussian vectors in $\mathbb{R}^d$ with mean $\mu$ and covariance $\Sigma$. In the strong $\varepsilon$-contamination model we assume that the adversary replaced an $\varepsilon$ fraction of vectors in the original Gaussian sample by any other vectors. We show that there is an estimator $\widehat \mu$ of the mean satisfying, with probability at least $1 - \delta$, a bound of the form \[ \|\widehat{\mu} - \mu\|_2 \le c\left(\sqrt{\frac{\operatorname{Tr}(\Sigma)}{N}} + \sqrt{\frac{\|\Sigma\|\log(1/\delta)}{N}} + \varepsilon\sqrt{\|\Sigma\|}\right), \] where $c > 0$ is an absolute constant and $\|\Sigma\|$ denotes the operator norm of $\Sigma$. In the same contaminated Gaussian setup, we construct an estimator $\widehat \Sigma$ of the covariance matrix $\Sigma$ that satisfies, with probability at least $1 - \delta$, \[ \left\|\widehat{\Sigma} - \Sigma\right\| \le c\left(\sqrt{\frac{\|\Sigma\|\operatorname{Tr}(\Sigma)}{N}} + \|\Sigma\|\sqrt{\frac{\log(1/\delta)}{N}} + \varepsilon\|\Sigma\|\right). \] Both results are optimal up to multiplicative constant factors. Despite the recent significant interest in robust statistics, achieving both dimension-free bounds in the canonical Gaussian case remained open. In fact, several previously known results were either dimension-dependent and required $\Sigma$ to be close to identity, or had a sub-optimal dependence on the contamination level $\varepsilon$. As a part of the analysis, we derive sharp concentration inequalities for central order statistics of Gaussian, folded normal, and chi-squared distributions.
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v3 [cs.LG] UPDATED)
    The black-box nature of deep neural networks (DNN) has brought to attention the issues of transparency and fairness. Deep Reinforcement Learning (Deep RL or DRL), which uses DNN to learn its policy, value functions etc, is thus also subject to similar concerns. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. The framework introduced in this paper is called the Self Reward Design (SRD), inspired by the Inverse Reward Design, and this interpretable design can (1) solve the problem by pure design (although imperfectly) and (2) be optimized like a standard DNN. With deliberate human designs, we show that some RL problems such as lavaland and MuJoCo can be solved using a model constructed with standard NN components with few parameters. Furthermore, with our fish sale auction example, we demonstrate how SRD is used to address situations that will not make sense if black-box models are used, where humanly-understandable semantic-based decision is required.
    Unifying Synergies between Self-supervised Learning and Dynamic Computation. (arXiv:2301.09164v1 [cs.LG])
    Self-supervised learning (SSL) approaches have made major strides forward by emulating the performance of their supervised counterparts on several computer vision benchmarks. This, however, comes at a cost of substantially larger model sizes, and computationally expensive training strategies, which eventually lead to larger inference times making it impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweight sub-network, which usually involves multiple epochs of fine-tuning of a large pre-trained model, making it more computationally challenging. In this work we propose a novel perspective on the interplay between SSL and DC paradigms that can be leveraged to simultaneously learn a dense and gated (sparse/lightweight) sub-network from scratch offering a good accuracy-efficiency trade-off, and therefore yielding a generic and multi-purpose architecture for application specific industrial settings. Our study overall conveys a constructive message: exhaustive experiments on several image classification benchmarks: CIFAR-10, STL-10, CIFAR-100, and ImageNet-100, demonstrates that the proposed training strategy provides a dense and corresponding sparse sub-network that achieves comparable (on-par) performance compared with the vanilla self-supervised setting, but at a significant reduction in computation in terms of FLOPs under a range of target budgets.
    Raw or Cooked? Object Detection on RAW Images. (arXiv:2301.08965v1 [cs.CV])
    Images fed to a deep neural network have in general undergone several handcrafted image signal processing (ISP) operations, all of which have been optimized to produce visually pleasing images. In this work, we investigate the hypothesis that the intermediate representation of visually pleasing images is sub-optimal for downstream computer vision tasks compared to the RAW image representation. We suggest that the operations of the ISP instead should be optimized towards the end task, by learning the parameters of the operations jointly during training. We extend previous works on this topic and propose a new learnable operation that enables an object detector to achieve superior performance when compared to both previous works and traditional RGB images. In experiments on the open PASCALRAW dataset, we empirically confirm our hypothesis.
    Self-Supervised Image Representation Learning: Transcending Masking with Paired Image Overlay. (arXiv:2301.09299v1 [cs.CV])
    Self-supervised learning has become a popular approach in recent years for its ability to learn meaningful representations without the need for data annotation. This paper proposes a novel image augmentation technique, overlaying images, which has not been widely applied in self-supervised learning. This method is designed to provide better guidance for the model to understand underlying information, resulting in more useful representations. The proposed method is evaluated using contrastive learning, a widely used self-supervised learning method that has shown solid performance in downstream tasks. The results demonstrate the effectiveness of the proposed augmentation technique in improving the performance of self-supervised models.
    Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AI. (arXiv:2201.00009v3 [cs.LG] UPDATED)
    This paper quantifies the quality of heatmap-based eXplainable AI (XAI) methods w.r.t image classification problem. Here, a heatmap is considered desirable if it improves the probability of predicting the correct classes. Different XAI heatmap-based methods are empirically shown to improve classification confidence to different extents depending on the datasets, e.g. Saliency works best on ImageNet and Deconvolution on Chest X-Ray Pneumonia dataset. The novelty includes a new gap distribution that shows a stark difference between correct and wrong predictions. Finally, the generative augmentative explanation is introduced, a method to generate heatmaps capable of improving predictive confidence to a high level.
    Adapting a Language Model While Preserving its General Knowledge. (arXiv:2301.08986v1 [cs.CL])
    Domain-adaptive pre-training (or DA-training for short), also known as post-training, aims to train a pre-trained general-purpose language model (LM) using an unlabeled corpus of a particular domain to adapt the LM so that end-tasks in the domain can give improved performances. However, existing DA-training methods are in some sense blind as they do not explicitly identify what knowledge in the LM should be preserved and what should be changed by the domain corpus. This paper shows that the existing methods are suboptimal and proposes a novel method to perform a more informed adaptation of the knowledge in the LM by (1) soft-masking the attention heads based on their importance to best preserve the general knowledge in the LM and (2) contrasting the representations of the general and the full (both general and domain knowledge) to learn an integrated representation with both general and domain-specific knowledge. Experimental results will demonstrate the effectiveness of the proposed approach.
    The Shape of Explanations: A Topological Account of Rule-Based Explanations in Machine Learning. (arXiv:2301.09042v1 [cs.LG])
    Rule-based explanations provide simple reasons explaining the behavior of machine learning classifiers at given points in the feature space. Several recent methods (Anchors, LORE, etc.) purport to generate rule-based explanations for arbitrary or black-box classifiers. But what makes these methods work in general? We introduce a topological framework for rule-based explanation methods and provide a characterization of explainability in terms of the definability of a classifier relative to an explanation scheme. We employ this framework to consider various explanation schemes and argue that the preferred scheme depends on how much the user knows about the domain and the probability measure over the feature space.
    Efficient Training Under Limited Resources. (arXiv:2301.09264v1 [cs.LG])
    Training time budget and size of the dataset are among the factors affecting the performance of a Deep Neural Network (DNN). This paper shows that Neural Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data Augmentation help DNNs perform much better while these two factors are limited. However, searching for an optimal architecture and the best hyperparameter values besides a good combination of data augmentation techniques under low resources requires many experiments. We present our approach to achieving such a goal in three steps: reducing training epoch time by compressing the model while maintaining the performance compared to the original model, preventing model overfitting when the dataset is small, and performing the hyperparameter tuning. We used NOMAD, which is a blackbox optimization software based on a derivative-free algorithm to do NAS and HPO. Our work achieved an accuracy of 86.0 % on a tiny subset of Mini-ImageNet at the ICLR 2021 Hardware Aware Efficient Training (HAET) Challenge and won second place in the competition. The competition results can be found at haet2021.github.io/challenge and our source code can be found at github.com/DouniaLakhmiri/ICLR\_HAET2021.
    Pre-text Representation Transfer for Deep Learning with Limited Imbalanced Data : Application to CT-based COVID-19 Detection. (arXiv:2301.08888v1 [eess.IV])
    Annotating medical images for disease detection is often tedious and expensive. Moreover, the available training samples for a given task are generally scarce and imbalanced. These conditions are not conducive for learning effective deep neural models. Hence, it is common to 'transfer' neural networks trained on natural images to the medical image domain. However, this paradigm lacks in performance due to the large domain gap between the natural and medical image data. To address that, we propose a novel concept of Pre-text Representation Transfer (PRT). In contrast to the conventional transfer learning, which fine-tunes a source model after replacing its classification layers, PRT retains the original classification layers and updates the representation layers through an unsupervised pre-text task. The task is performed with (original, not synthetic) medical images, without utilizing any annotations. This enables representation transfer with a large amount of training data. This high-fidelity representation transfer allows us to use the resulting model as a more effective feature extractor. Moreover, we can also subsequently perform the traditional transfer learning with this model. We devise a collaborative representation based classification layer for the case when we leverage the model as a feature extractor. We fuse the output of this layer with the predictions of a model induced with the traditional transfer learning performed over our pre-text transferred model. The utility of our technique for limited and imbalanced data classification problem is demonstrated with an extensive five-fold evaluation for three large-scale models, tested for five different class-imbalance ratios for CT based COVID-19 detection. Our results show a consistent gain over the conventional transfer learning with the proposed method.
    Spatial Attention Kinetic Networks with E(n)-Equivariance. (arXiv:2301.08893v1 [cs.LG])
    Neural networks that are equivariant to rotations, translations, reflections, and permutations on n-dimensional geometric space have shown promise in physical modeling for tasks such as accurately but inexpensively modeling complex potential energy surfaces to guiding the sampling of complex dynamical systems or forecasting their time evolution. Current state-of-the-art methods employ spherical harmonics to encode higher-order interactions among particles, which are computationally expensive. In this paper, we propose a simple alternative functional form that uses neurally parametrized linear combinations of edge vectors to achieve equivariance while still universally approximating node environments. Incorporating this insight, we design spatial attention kinetic networks with E(n)-equivariance, or SAKE, which are competitive in many-body system modeling tasks while being significantly faster.
    The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation. (arXiv:2301.08968v1 [cs.LG])
    Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -- information that we refer to as ``hyper-knowledge". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.
    Classification of Luminal Subtypes in Full Mammogram Images Using Transfer Learning. (arXiv:2301.09282v1 [eess.IV])
    Automatic identification of patients with luminal and non-luminal subtypes during a routine mammography screening can support clinicians in streamlining breast cancer therapy planning. Recent machine learning techniques have shown promising results in molecular subtype classification in mammography; however, they are highly dependent on pixel-level annotations, handcrafted, and radiomic features. In this work, we provide initial insights into the luminal subtype classification in full mammogram images trained using only image-level labels. Transfer learning is applied from a breast abnormality classification task, to finetune a ResNet-18-based luminal versus non-luminal subtype classification task. We present and compare our results on the publicly available CMMD dataset and show that our approach significantly outperforms the baseline classifier by achieving a mean AUC score of 0.6688 and a mean F1 score of 0.6693 on the test dataset. The improvement over baseline is statistically significant, with a p-value of p<0.0001.
    A Semantic Modular Framework for Events Topic Modeling in Social Media. (arXiv:2301.09009v1 [cs.LG])
    The advancement of social media contributes to the growing amount of content they share frequently. This framework provides a sophisticated place for people to report various real-life events. Detecting these events with the help of natural language processing has received researchers' attention, and various algorithms have been developed for this goal. In this paper, we propose a Semantic Modular Model (SMM) consisting of 5 different modules, namely Distributional Denoising Autoencoder, Incremental Clustering, Semantic Denoising, Defragmentation, and Ranking and Processing. The proposed model aims to (1) cluster various documents and ignore the documents that might not contribute to the identification of events, (2) identify more important and descriptive keywords. Compared to the state-of-the-art methods, the results show that the proposed model has a higher performance in identifying events with lower ranks and extracting keywords for more important events in three English Twitter datasets: FACup, SuperTuesday, and USElection. The proposed method outperformed the best reported results in the mean keyword-precision metric by 7.9\%.
    Leveraging Speaker Embeddings with Adversarial Multi-task Learning for Age Group Classification. (arXiv:2301.09058v1 [eess.AS])
    Recently, researchers have utilized neural network-based speaker embedding techniques in speaker-recognition tasks to identify speakers accurately. However, speaker-discriminative embeddings do not always represent speech features such as age group well. In an embedding model that has been highly trained to capture speaker traits, the task of age group classification is closer to speech information leakage. Hence, to improve age group classification performance, we consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups. In addition, we investigated different types of speaker embeddings to learn and generalize the domain-invariant representations for age groups. Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios and leveraging speaker embeddings for the domain adaptation task.
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.
    Is Nash Equilibrium Approximator Learnable?. (arXiv:2108.07472v5 [cs.GT] UPDATED)
    In this paper, we investigate the learnability of the function approximator that approximates Nash equilibrium (NE) for games generated from a distribution. First, we offer a generalization bound using the Probably Approximately Correct (PAC) learning model. The bound describes the gap between the expected loss and empirical loss of the NE approximator. Afterward, we prove the agnostic PAC learnability of the Nash approximator. In addition to theoretical analysis, we demonstrate an application of NE approximator in experiments. The trained NE approximator can be used to warm-start and accelerate classical NE solvers. Together, our results show the practicability of approximating NE through function approximation.
    Regeneration Learning: A Learning Paradigm for Data Generation. (arXiv:2301.08846v1 [cs.LG])
    Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y. The target Y (e.g., text, speech, music, image, video) is usually high-dimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y' (an abstraction/representation of Y) from X and then generates Y from Y'. During training, Y' is obtained from Y through either handcrafted rules or self-supervised learning and is used to learn X-->Y' and Y'-->Y. Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y') of the target data Y for data generation while traditional representation learning handles the abstraction (X') of source data X for data understanding; 2) both the processes of Y'-->Y in regeneration learning and X-->X' in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y' in regeneration learning and from X' to Y in representation learning are simpler than the direct mapping from X to Y. We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.
    Debiasing the Cloze Task in Sequential Recommendation with Bidirectional Transformers. (arXiv:2301.09210v1 [cs.LG])
    Bidirectional Transformer architectures are state-of-the-art sequential recommendation models that use a bi-directional representation capacity based on the Cloze task, a.k.a. Masked Language Modeling. The latter aims to predict randomly masked items within the sequence. Because they assume that the true interacted item is the most relevant one, an exposure bias results, where non-interacted items with low exposure propensities are assumed to be irrelevant. The most common approach to mitigating exposure bias in recommendation has been Inverse Propensity Scoring (IPS), which consists of down-weighting the interacted predictions in the loss function in proportion to their propensities of exposure, yielding a theoretically unbiased learning. In this work, we argue and prove that IPS does not extend to sequential recommendation because it fails to account for the temporal nature of the problem. We then propose a novel propensity scoring mechanism, which can theoretically debias the Cloze task in sequential recommendation. Finally we empirically demonstrate the debiasing capabilities of our proposed approach and its robustness to the severity of exposure bias.
    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.
    DeepFEL: Deep Fastfood Ensemble Learning for Histopathology Image Analysis. (arXiv:2301.09525v1 [eess.IV])
    Computational pathology tasks have some unique characterises such as multi-gigapixel images, tedious and frequently uncertain annotations, and unavailability of large number of cases [13]. To address some of these issues, we present Deep Fastfood Ensembles - a simple, fast and yet effective method for combining deep features pooled from popular CNN models pre-trained on totally different source domains (e.g., natural image objects) and projected onto diverse dimensions using random projections, the so-called Fastfood [11]. The final ensemble output is obtained by a consensus of simple individual classifiers, each of which is trained on a different collection of random basis vectors. This offers extremely fast and yet effective solution, especially when training times and domain labels are of the essence. We demonstrate the effectiveness of the proposed deep fastfood ensemble learning as compared to the state-of-the-art methods for three different tasks in histopathology image analysis.
    Accelerating Fair Federated Learning: Adaptive Federated Adam. (arXiv:2301.09357v1 [cs.LG])
    Federated learning is a distributed and privacy-preserving approach to train a statistical model collaboratively from decentralized data of different parties. However, when datasets of participants are not independent and identically distributed (non-IID), models trained by naive federated algorithms may be biased towards certain participants, and model performance across participants is non-uniform. This is known as the fairness problem in federated learning. In this paper, we formulate fairness-controlled federated learning as a dynamical multi-objective optimization problem to ensure fair performance across all participants. To solve the problem efficiently, we study the convergence and bias of Adam as the server optimizer in federated learning, and propose Adaptive Federated Adam (AdaFedAdam) to accelerate fair federated learning with alleviated bias. We validated the effectiveness, Pareto optimality and robustness of AdaFedAdam in numerical experiments and show that AdaFedAdam outperforms existing algorithms, providing better convergence and fairness properties of the federated scheme.
    A Structural Approach to the Design of Domain Specific Neural Network Architectures. (arXiv:2301.09381v1 [cs.LG])
    This is a master's thesis concerning the theoretical ideas of geometric deep learning. Geometric deep learning aims to provide a structured characterization of neural network architectures, specifically focused on the ideas of invariance and equivariance of data with respect to given transformations. This thesis aims to provide a theoretical evaluation of geometric deep learning, compiling theoretical results that characterize the properties of invariant neural networks with respect to learning performance.
    HALOC: Hardware-Aware Automatic Low-Rank Compression for Compact Neural Networks. (arXiv:2301.09422v1 [cs.LG])
    Low-rank compression is an important model compression strategy for obtaining compact neural network models. In general, because the rank values directly determine the model complexity and model accuracy, proper selection of layer-wise rank is very critical and desired. To date, though many low-rank compression approaches, either selecting the ranks in a manual or automatic way, have been proposed, they suffer from costly manual trials or unsatisfied compression performance. In addition, all of the existing works are not designed in a hardware-aware way, limiting the practical performance of the compressed models on real-world hardware platforms. To address these challenges, in this paper we propose HALOC, a hardware-aware automatic low-rank compression framework. By interpreting automatic rank selection from an architecture search perspective, we develop an end-to-end solution to determine the suitable layer-wise ranks in a differentiable and hardware-aware way. We further propose design principles and mitigation strategy to efficiently explore the rank space and reduce the potential interference problem. Experimental results on different datasets and hardware platforms demonstrate the effectiveness of our proposed approach. On CIFAR-10 dataset, HALOC enables 0.07% and 0.38% accuracy increase over the uncompressed ResNet-20 and VGG-16 models with 72.20% and 86.44% fewer FLOPs, respectively. On ImageNet dataset, HALOC achieves 0.9% higher top-1 accuracy than the original ResNet-18 model with 66.16% fewer FLOPs. HALOC also shows 0.66% higher top-1 accuracy increase than the state-of-the-art automatic low-rank compression solution with fewer computational and memory costs. In addition, HALOC demonstrates the practical speedups on different hardware platforms, verified by the measurement results on desktop GPU, embedded GPU and ASIC accelerator.
    StockEmotions: Discover Investor Emotions for Financial Sentiment Analysis and Multivariate Time Series. (arXiv:2301.09279v1 [cs.CL])
    There has been growing interest in applying NLP techniques in the financial domain, however, resources are extremely limited. This paper introduces StockEmotions, a new dataset for detecting emotions in the stock market that consists of 10,000 English comments collected from StockTwits, a financial social media platform. Inspired by behavioral finance, it proposes 12 fine-grained emotion classes that span the roller coaster of investor emotion. Unlike existing financial sentiment datasets, StockEmotions presents granular features such as investor sentiment classes, fine-grained emotions, emojis, and time series data. To demonstrate the usability of the dataset, we perform a dataset analysis and conduct experimental downstream tasks. For financial sentiment/emotion classification tasks, DistilBERT outperforms other baselines, and for multivariate time series forecasting, a Temporal Attention LSTM model combining price index, text, and emotion features achieves the best performance than using a single feature.
    Logical Message Passing Networks with One-hop Inference on Atomic Formulas. (arXiv:2301.08859v1 [cs.LG])
    Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embeddings from zero, where whether and how the embeddings or the neural set operators contribute to the performance remains not clear. In this paper, we propose a simple framework for complex query answering that decomposes the KG embeddings from neural set operators. We propose to represent the complex queries in the query graph. On top of the query graph, we propose the Logical Message Passing Neural Network (LMPNN) that connects the \textit{local} one-hop inferences on atomic formulas to the \textit{global} logical reasoning for complex query answering. We leverage existing effective KG embeddings to conduct one-hop inferences on atomic formulas, the results of which are regarded as the messages passed in LMPNN. The reasoning process over the overall logical formulas is turned into the forward pass of LMPNN that incrementally aggregates local information to predict the answers' embeddings finally. The complex logical inference across different types of queries will then be learned from training examples based on the LMPNN architecture. Theoretically, our query-graph representation is more general than the prevailing operator-tree formulation, so our approach applies to a broader range of complex KG queries. Empirically, our approach yields a new state-of-the-art neural CQA model. Our research bridges the gap between complex KG query answering tasks and the long-standing achievements of knowledge graph representation learning.  ( 2 min )
    Impact of PCA-based preprocessing and different CNN structures on deformable registration of sonograms. (arXiv:2301.08802v1 [cs.CV])
    Central venous catheters (CVC) are commonly inserted into the large veins of the neck, e.g. the internal jugular vein (IJV). CVC insertion may cause serious complications like misplacement into an artery or perforation of cervical vessels. Placing a CVC under sonographic guidance is an appropriate method to reduce such adverse events, if anatomical landmarks like venous and arterial vessels can be detected reliably. This task shall be solved by registration of patient individual images vs. an anatomically labelled reference image. In this work, a linear, affine transformation is performed on cervical sonograms, followed by a non-linear transformation to achieve a more precise registration. Voxelmorph (VM), a learning-based library for deformable image registration using a convolutional neural network (CNN) with U-Net structure was used for non-linear transformation. The impact of principal component analysis (PCA)-based pre-denoising of patient individual images, as well as the impact of modified net structures with differing complexities on registration results were examined visually and quantitatively, the latter using metrics for deformation and image similarity. Using the PCA-approximated cervical sonograms resulted in decreased mean deformation lengths between 18% and 66% compared to their original image counterparts, depending on net structure. In addition, reducing the number of convolutional layers led to improved image similarity with PCA images, while worsening in original images. Despite a large reduction of network parameters, no overall decrease in registration quality was observed, leading to the conclusion that the original net structure is oversized for the task at hand.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Fast likelihood-based change point detection. (arXiv:2301.08892v1 [cs.LG])
    Change point detection plays a fundamental role in many real-world applications, where the goal is to analyze and monitor the behaviour of a data stream. In this paper, we study change detection in binary streams. To this end, we use a likelihood ratio between two models as a measure for indicating change. The first model is a single bernoulli variable while the second model divides the stored data in two segments, and models each segment with its own bernoulli variable. Finding the optimal split can be done in $O(n)$ time, where $n$ is the number of entries since the last change point. This is too expensive for large $n$. To combat this we propose an approximation scheme that yields $(1 - \epsilon)$ approximation in $O(\epsilon^{-1} \log^2 n)$ time. The speed-up consists of several steps: First we reduce the number of possible candidates by adopting a known result from segmentation problems. We then show that for fixed bernoulli parameters we can find the optimal change point in logarithmic time. Finally, we show how to construct a candidate list of size $O(\epsilon^{-1} \log n)$ for model parameters. We demonstrate empirically the approximation quality and the running time of our algorithm, showing that we can gain a significant speed-up with a minimal average loss in optimality.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Improving Deep Regression with Ordinal Entropy. (arXiv:2301.08915v1 [cs.CV])
    In computer vision, it is often observed that formulating regression problems as a classification task often yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the cross-entropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy loss to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression.  ( 2 min )
    Developing Hybrid Machine Learning Models to Assign Health Score to Railcar Fleets for Optimal Decision Making. (arXiv:2301.08877v1 [cs.LG])
    A large amount of data is generated during the operation of a railcar fleet, which can easily lead to dimensional disaster and reduce the resiliency of the railcar network. To solve these issues and offer predictive maintenance, this research introduces a hybrid fault diagnosis expert system method that combines density-based spatial clustering of applications with noise (DBSCAN) and principal component analysis (PCA). Firstly, the DBSCAN method is used to cluster categorical data that are similar to one another within the same group. Secondly, PCA algorithm is applied to reduce the dimensionality of the data and eliminate redundancy in order to improve the accuracy of fault diagnosis. Finally, we explain the engineered features and evaluate the selected models by using the Gain Chart and Area Under Curve (AUC) metrics. We use the hybrid expert system model to enhance maintenance planning decisions by assigning a health score to the railcar system of the North American Railcar Owner (NARO). According to the experimental results, our expert model can detect 96.4% of failures within 50% of the sample. This suggests that our method is effective at diagnosing failures in railcars fleet.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    A Communication-Efficient Adaptive Algorithm for Federated Learning under Cumulative Regret. (arXiv:2301.08869v1 [cs.LG])
    We consider the problem of online stochastic optimization in a distributed setting with $M$ clients connected through a central server. We develop a distributed online learning algorithm that achieves order-optimal cumulative regret with low communication cost measured in the total number of bits transmitted over the entire learning horizon. This is in contrast to existing studies which focus on the offline measure of simple regret for learning efficiency. The holistic measure for communication cost also departs from the prevailing approach that \emph{separately} tackles the communication frequency and the number of bits in each communication round.  ( 2 min )
    Rationalization for Explainable NLP: A Survey. (arXiv:2301.08912v1 [cs.CL])
    Recent advances in deep learning have improved the performance of many Natural Language Processing (NLP) tasks such as translation, question-answering, and text classification. However, this improvement comes at the expense of model explainability. Black-box models make it difficult to understand the internals of a system and the process it takes to arrive at an output. Numerical (LIME, Shapley) and visualization (saliency heatmap) explainability techniques are helpful; however, they are insufficient because they require specialized knowledge. These factors led rationalization to emerge as a more accessible explainable technique in NLP. Rationalization justifies a model's output by providing a natural language explanation (rationale). Recent improvements in natural language generation have made rationalization an attractive technique because it is intuitive, human-comprehensible, and accessible to non-technical users. Since rationalization is a relatively new field, it is disorganized. As the first survey, rationalization literature in NLP from 2007-2022 is analyzed. This survey presents available methods, explainable evaluations, code, and datasets used across various NLP tasks that use rationalization. Further, a new subfield in Explainable AI (XAI), namely, Rational AI (RAI), is introduced to advance the current state of rationalization. A discussion on observed insights, challenges, and future directions is provided to point to promising research opportunities.  ( 2 min )
    Federated Recommendation with Additive Personalization. (arXiv:2301.09109v1 [cs.LG])
    With rising concerns about privacy, developing recommendation systems in a federated setting become a new paradigm to develop next-generation Internet service architecture. However, existing approaches are usually derived from a distributed recommendation framework with an additional mechanism for privacy protection, thus most of them fail to fully exploit personalization in the new context of federated recommendation settings. In this paper, we propose a novel approach called Federated Recommendation with Additive Personalization (FedRAP) to enhance recommendation by learning user embedding and the user's personal view of item embeddings. Specifically, the proposed additive personalization is to add a personalized item embedding to a sparse global item embedding aggregated from all users. Moreover, a curriculum learning mechanism has been applied for additive personalization on item embeddings by gradually increasing regularization weights to mitigate the performance degradation caused by large variances among client-specific item embeddings. A unified formulation has been proposed with a sparse regularization of global item embeddings for reducing communication overhead. Experimental results on four real-world recommendation datasets demonstrate the effectiveness of FedRAP.  ( 2 min )
    Slice Transformer and Self-supervised Learning for 6DoF Localization in 3D Point Cloud Maps. (arXiv:2301.08957v1 [cs.CV])
    Precise localization is critical for autonomous vehicles. We present a self-supervised learning method that employs Transformers for the first time for the task of outdoor localization using LiDAR data. We propose a pre-text task that reorganizes the slices of a $360^\circ$ LiDAR scan to leverage its axial properties. Our model, called Slice Transformer, employs multi-head attention while systematically processing the slices. To the best of our knowledge, this is the first instance of leveraging multi-head attention for outdoor point clouds. We additionally introduce the Perth-WA dataset, which provides a large-scale LiDAR map of Perth city in Western Australia, covering $\sim$4km$^2$ area. Localization annotations are provided for Perth-WA. The proposed localization method is thoroughly evaluated on Perth-WA and Appollo-SouthBay datasets. We also establish the efficacy of our self-supervised learning approach for the common downstream task of object classification using ModelNet40 and ScanNN datasets. The code and Perth-WA data will be publicly released.  ( 2 min )
    Estimation of Sea State Parameters from Ship Motion Responses Using Attention-based Neural Networks. (arXiv:2301.08949v1 [cs.LG])
    On-site estimation of sea state parameters is crucial for ship navigation systems' accuracy, stability, and efficiency. Extensive research has been conducted on model-based estimating methods utilizing only ship motion responses. Model-free approaches based on machine learning (ML) have recently gained popularity, and estimation from time-series of ship motion responses using deep learning (DL) methods has given promising results. Accordingly, in this study, we apply the novel, attention-based neural network (AT-NN) for estimating sea state parameters (wave height, zero-crossing period, and relative wave direction) from raw time-series data of ship pitch, heave, and roll motions. Despite using reduced input data, it has been successfully demonstrated that the proposed approaches by modified state-of-the-art techniques (based on convolutional neural networks (CNN) for regression, multivariate long short-term memory CNN, and sliding puzzle neural network) reduced estimation MSE by 23% and MAE by 16% compared to the original methods. Furthermore, the proposed technique based on AT-NN outperformed all tested methods (original and enhanced), reducing estimation MSE by up to 94% and MAE by up to 70%. Finally, we also proposed a novel approach for interpreting the uncertainty estimation of neural network outputs based on the Monte-Carlo dropout method to enhance the model's trustworthiness.  ( 2 min )
    Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting. (arXiv:2301.08974v1 [cs.LG])
    The semiconductor industry is one of the most technology-evolving and capital-intensive market sectors. Effective inspection and metrology are necessary to improve product yield, increase product quality and reduce costs. In recent years, many semiconductor manufacturing equipments are equipped with sensors to facilitate real-time monitoring of the production process. These production-state and equipment-state sensor data provide an opportunity to practice machine-learning technologies in various domains, such as anomaly/fault detection, maintenance scheduling, quality prediction, etc. In this work, we focus on the task of soft sensing regression, which uses sensor data to predict impending inspection measurements that used to be measured in wafer inspection and metrology systems. We proposed an LSTM-based regressor and designed two loss functions for model training. Although engineers may look at our prediction errors in a subjective manner, a new piece-wise evaluation metric was proposed for assessing model accuracy in a mathematical way. The experimental results demonstrated that the proposed model can achieve accurate and early prediction of various types of inspections in complicated manufacturing processes.  ( 2 min )
    Bayesian Hierarchical Models for Counterfactual Estimation. (arXiv:2301.08833v1 [cs.LG])
    Counterfactual explanations utilize feature perturbations to analyze the outcome of an original decision and recommend an actionable recourse. We argue that it is beneficial to provide several alternative explanations rather than a single point solution and propose a probabilistic paradigm to estimate a diverse set of counterfactuals. Specifically, we treat the perturbations as random variables endowed with prior distribution functions. This allows sampling multiple counterfactuals from the posterior density, with the added benefit of incorporating inductive biases, preserving domain specific constraints and quantifying uncertainty in estimates. More importantly, we leverage Bayesian hierarchical modeling to share information across different subgroups of a population, which can both improve robustness and measure fairness. A gradient based sampler with superior convergence characteristics efficiently computes the posterior samples. Experiments across several datasets demonstrate that the counterfactuals estimated using our approach are valid, sparse, diverse and feasible.  ( 2 min )
    Dense RGB SLAM with Neural Implicit Maps. (arXiv:2301.08930v1 [cs.CV])
    There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmarks and compare it with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods. Our source code will be publicly available.  ( 2 min )
    SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis. (arXiv:2301.09201v1 [cs.IR])
    5G is the 5th generation cellular network protocol. It is the state-of-the-art global wireless standard that enables an advanced kind of network designed to connect virtually everyone and everything with increased speed and reduced latency. Therefore, its development, analysis, and security are critical. However, all approaches to the 5G protocol development and security analysis, e.g., property extraction, protocol summarization, and semantic analysis of the protocol specifications and implementations are completely manual. To reduce such manual effort, in this paper, we curate SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains 3,547,586 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. By leveraging large-scale pre-trained language models that have achieved state-of-the-art results on NLP tasks, we use this dataset for security-related text classification and summarization. Security-related text classification can be used to extract relevant security-related properties for protocol testing. On the other hand, summarization can help developers and practitioners understand the high level of the protocol, which is itself a daunting task. Our results show the value of our 5G-centric dataset in 5G protocol analysis automation. We believe that SPEC5G will enable a new research direction into automatic analyses for the 5G cellular network protocol and numerous related downstream tasks. Our data and code are publicly available.
    Is Signed Message Essential for Graph Neural Networks?. (arXiv:2301.08918v1 [cs.LG])
    Message-passing Graph Neural Networks (GNNs), which collect information from adjacent nodes, achieve satisfying results on homophilic graphs. However, their performances are dismal in heterophilous graphs, and many researchers have proposed a plethora of schemes to solve this problem. Especially, flipping the sign of edges is rooted in a strong theoretical foundation, and attains significant performance enhancements. Nonetheless, previous analyses assume a binary class scenario and they may suffer from confined applicability. This paper extends the prior understandings to multi-class scenarios and points out two drawbacks: (1) the sign of multi-hop neighbors depends on the message propagation paths and may incur inconsistency, (2) it also increases the prediction uncertainty (e.g., conflict evidence) which can impede the stability of the algorithm. Based on the theoretical understanding, we introduce a novel strategy that is applicable to multi-class graphs. The proposed scheme combines confidence calibration to secure robustness while reducing uncertainty. We show the efficacy of our theorem through extensive experiments on six benchmark graph datasets.  ( 2 min )
    On the Algebraic Properties of Flame Graphs. (arXiv:2301.08941v1 [cs.SE])
    Flame graphs are a popular way of representing profiling data. In this paper we propose a possible mathematical definition of flame graphs. In doing so, we gain some interesting algebraic properties almost for free, which in turn allow us to define some operations that can allow to perform an in-depth performance regression analysis. The typical documented use of a flame graph is via its graphical representation, whereby one scans the picture for the largest plateaux. Whilst this method is effective at finding the main sources of performance issues, it leaves quite a large amount of data potentially unused. By combining a mathematical precise definition of flame graphs with some statistical methods we show how to generalise this visual procedure and make the best of the full set of collected profiling data.  ( 2 min )
    ScaDLES: Scalable Deep Learning over Streaming data at the Edge. (arXiv:2301.08897v1 [cs.DC])
    Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming data in an online manner. Computing on the edge suffers from both systems and statistical heterogeneity. Systems heterogeneity is attributed to differences in compute resources and bandwidth specific to each device, while statistical heterogeneity comes from unbalanced and skewed data on the edge. Different streaming-rates among devices can be another source of heterogeneity when dealing with streaming data. If the streaming rate is lower than training batch-size, device needs to wait until enough samples have streamed in before performing a single iteration of stochastic gradient descent (SGD). Thus, low-volume streams act like stragglers slowing down devices with high-volume streams in synchronous training. On the other hand, data can accumulate quickly in the buffer if the streaming rate is too high and the devices can't train at line-rate. In this paper, we introduce ScaDLES to efficiently train on streaming data at the edge in an online fashion, while also addressing the challenges of limited bandwidth and training with non-IID data. We empirically show that ScaDLES converges up to 3.29 times faster compared to conventional distributed SGD.
    Cellular Network Speech Enhancement: Removing Background and Transmission Noise. (arXiv:2301.09027v1 [cs.SD])
    The primary objective of speech enhancement is to reduce background noise while preserving the target's speech. A common dilemma occurs when a speaker is confined to a noisy environment and receives a call with high background and transmission noise. To address this problem, the Deep Noise Suppression (DNS) Challenge focuses on removing the background noise with the next-generation deep learning models to enhance the target's speech; however, researchers fail to consider Voice Over IP (VoIP) applications their transmission noise. Focusing on Google Meet and its cellular application, our work achieves state-of-the-art performance on the Google Meet To Phone Track of the VoIP DNS Challenge. This paper demonstrates how to beat industrial performance and achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity, perceptual quality, and intelligibility in various metrics.
    Ti-MAE: Self-Supervised Masked Time Series Autoencoders. (arXiv:2301.08871v1 [cs.LG])
    Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformer-based models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformer-based models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.
    Versatile Neural Processes for Learning Implicit Neural Representations. (arXiv:2301.08883v1 [cs.LG])
    Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.  ( 2 min )
    Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions. (arXiv:2203.12235v3 [cs.CL] UPDATED)
    The syntactic categories of categorial grammar formalisms are structured units made of smaller, indivisible primitives, bound together by the underlying grammar's category formation rules. In the trending approach of constructive supertagging, neural models are increasingly made aware of the internal category structure, which in turn enables them to more reliably predict rare and out-of-vocabulary categories, with significant implications for grammars previously deemed too complex to find practical use. In this work, we revisit constructive supertagging from a graph-theoretic perspective, and propose a framework based on heterogeneous dynamic graph convolutions aimed at exploiting the distinctive structure of a supertagger's output space. We test our approach on a number of categorial grammar datasets spanning different languages and grammar formalisms, achieving substantial improvements over previous state of the art scores. Code will be made available at https://github.com/konstantinosKokos/dynamic-graph-supertagging  ( 2 min )
    Comparing different subgradient methods for solving convex optimization problems with functional constraints. (arXiv:2101.01045v2 [math.OC] UPDATED)
    We consider the problem of minimizing a convex, nonsmooth function subject to a closed convex constraint domain. The methods that we propose are reforms of subgradient methods based on Metel--Takeda's paper [Optimization Letters 15.4 (2021): 1491-1504] and Boyd's works [Lecture notes of EE364b, Stanford University, Spring 2013-14, pp. 1-39]. While the former has complexity $\mathcal{O}(\varepsilon^{-2r})$ for all $r> 1$, the complexity of the latter is $\mathcal{O}(\varepsilon^{-2})$. We perform some comparisons between these two methods using several test examples.  ( 2 min )
    Problem-dependent attention and effort in neural networks with application to image resolution and model selection. (arXiv:2201.01415v3 [cs.CV] UPDATED)
    This paper introduces a new ensemble-based approach to reduce the data and computation costs of accurate classification. When faced with a new test case, a low cost classifier is used first, only moving to a higher cost approach if the initial classifier does not have a high degree of confidence in its projection. This multi-stage strategy can be used with any set of classifiers and does not require additional training. The approach is first applied to reduce the amount of data required to classify test images; it is found to be effective for problems in which at least some fraction of cases can be correctly classified based upon coarser data than are typically used. For neural networks performing digit recognition, for example, the proposed approach reduces the number of bytes of data read by 60% to 85% with less than 5% reduction in accuracy. For the ImageNet data, the number of bytes read by the typical network is reduced by 20% with less than 5% reduction in accuracy -- and in some cases, the resource savings reach 40%. The second application is to reduce computational complexity, with simpler neural networks used for test cases that are easier to classify and complex networks used for more difficult cases. For classification both of digits and of ImageNet images, computation cost is reduced by as much as 82% to 89% with less than 5% reduction in accuracy. The results also show that, for situations in which computational cost is not a concern, calculating multiple models' projections and selecting the one from the most confident classifier can increase classification accuracy on ImageNet by as much as two percent over the best standalone classifier considered here.  ( 3 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    Continuous-time identification of dynamic state-space models by deep subspace encoding. (arXiv:2204.09405v2 [cs.LG] UPDATED)
    Continuous-time (CT) modeling has proven to provide improved sample efficiency and interpretability in learning the dynamical behavior of physical systems compared to discrete-time (DT) models. However, even with numerous recent developments, the CT nonlinear state-space (NL-SS) model identification problem remains to be solved in full, considering common experimental aspects such as the presence of external inputs, measurement noise, latent states, and general robustness. This paper presents a novel estimation method that addresses all these aspects and that can obtain state-of-the-art results on multiple benchmarks with compact fully connected neural networks capturing the CT dynamics. The proposed estimation method called the subspace encoder approach (SUBNET) ascertains these results by efficiently approximating the complete simulation loss by evaluating short simulations on subsections of the data, by using an encoder function to estimate the initial state for each subsection and a novel state-derivative normalization to ensure stability and good numerical conditioning of the training process. We prove that the use of subsections increases cost function smoothness together with the necessary requirements for the existence of the encoder function and we show that the proposed state-derivative normalization is essential for reliable estimation of CT NL-SS models.  ( 2 min )
    Explainable Multilayer Graph Neural Network for Cancer Gene Prediction. (arXiv:2301.08831v1 [cs.LG])
    The identification of cancer genes is a critical, yet challenging problem in cancer genomics research. Recently, several computational methods have been developed to address this issue, including deep neural networks. However, these methods fail to exploit the multilayered gene-gene interactions and provide little to no explanation for their predictions. Results: In this study, we propose an Explainable Multilayer Graph Neural Network (EMGNN) approach to identify cancer genes, by leveraging multiple gene-gene interaction networks and multi-omics data. Compared to conventional graph learning methods, EMGNN learned complementary information in multiple graphs to accurately predict cancer genes. Our method consistently outperforms existing approaches while providing valuable biological insights into its predictions. We further release our novel cancer gene predictions and connect them with known cancer patterns, aiming to accelerate the progress of cancer research  ( 2 min )
    Limitations of Piecewise Linearity for Efficient Robustness Certification. (arXiv:2301.08842v1 [cs.LG])
    Certified defenses against small-norm adversarial examples have received growing attention in recent years; though certified accuracies of state-of-the-art methods remain far below their non-robust counterparts, despite the fact that benchmark datasets have been shown to be well-separated at far larger radii than the literature generally attempts to certify. In this work, we offer insights that identify potential factors in this performance gap. Specifically, our analysis reveals that piecewise linearity imposes fundamental limitations on the tightness of leading certification techniques. These limitations are felt in practical terms as a greater need for capacity in models hoped to be certified efficiently. Moreover, this is in addition to the capacity necessary to learn a robust boundary, studied in prior work. However, we argue that addressing the limitations of piecewise linearity through scaling up model capacity may give rise to potential difficulties -- particularly regarding robust generalization -- therefore, we conclude by suggesting that developing smooth activation functions may be the way forward for advancing the performance of certified neural networks.  ( 2 min )
    Dynamic MLP for MRI Reconstruction. (arXiv:2301.08868v1 [eess.IV])
    As convolutional neural networks (CNN) become the most successful reconstruction technique for accelerated Magnetic Resonance Imaging (MRI), CNN reaches its limit on image quality especially in sharpness. Further improvement on image quality often comes at massive computational costs, hindering their practicability in the clinic setting. MRI reconstruction is essentially a deconvolution problem, which demands long-distance information that is difficult to be captured by CNNs with small convolution kernels. The multi-layer perceptron (MLP) is able to model such long-distance information, but it restricts a fixed input size while the reconstruction of images in flexible resolutions is required in the clinic setting. In this paper, we proposed a hybrid CNN and MLP reconstruction strategy, featured by dynamic MLP (dMLP) that accepts arbitrary image sizes. Experiments were conducted using 3D multi-coil MRI. Our results suggested the proposed dMLP can improve image sharpness compared to its pure CNN counterpart, while costing minor additional GPU memory and computation time. We further compared the proposed dMLP with CNNs using large kernels and studied pure MLP-based reconstruction using a stack of 1D dMLPs, as well as its CNN counterpart using only 1D convolutions. We observed the enlarged receptive field has noticeably improved image quality, while simply using CNN with a large kernel leads to difficulties in training. Noticeably, the pure MLP-based method has been outperformed by CNN-involved methods, which matches the observations in other computer vision tasks for natural images.  ( 2 min )
    Computing equilibria by minimizing exploitability with best-response ensembles. (arXiv:2301.08830v1 [cs.GT])
    In this paper, we study the problem of computing an approximate Nash equilibrium of a continuous game. Such games naturally model many situations involving space, time, money, and other fine-grained resources or quantities. The standard measure of the closeness of a strategy profile to Nash equilibrium is exploitability, which measures how much utility players can gain from changing their strategy unilaterally. We introduce a new equilibrium-finding method that minimizes an approximation of the exploitability. This approximation employs a best-response ensemble for each player that maintains multiple candidate best responses for that player. In each iteration, the best-performing element of each ensemble is used in a gradient-based scheme to update the current strategy profile. The strategy profile and best-response ensembles are simultaneously trained to minimize and maximize the approximate exploitability, respectively. Experiments on a suite of benchmark games show that it outperforms previous methods.  ( 2 min )
    Towards Flexibility and Interpretability of Gaussian Process State-Space Model. (arXiv:2301.08843v1 [cs.LG])
    Gaussian process state-space model (GPSSM) has attracted much attention over the past decade. However, the model representation power of GPSSM is far from satisfactory. Most GPSSM works rely on the standard Gaussian process (GP) with a preliminary kernel, such as squared exponential (SE) kernel and Mat\'{e}rn kernel, which limit the model representation power and its application in complex scenarios. To address this issue, this paper proposes a novel class of probabilistic state-space model named TGPSSM that enriches the GP priors in the standard GPSSM through parametric normalizing flow, making the state-space model more flexible and expressive. In addition, by inheriting the advantages of sparse representation of GP models, we propose a scalable and interpretable variational learning algorithm to learn the TGPSSM and infer the latent dynamics simultaneously. By integrating a constrained optimization framework and explicitly constructing a non-Gaussian state variational distribution, the proposed learning algorithm enables the TGPSSM to significantly improve the capabilities of state space representation and model inference. Experimental results based on various synthetic and real datasets corroborate that the proposed TGPSSM yields superior learning and inference performance compared to several state-of-the-art methods. The accompanying source code is available at https://github.com/zhidilin/TGPSSM.  ( 2 min )
    Split Ways: Privacy-Preserving Training of Encrypted Data Using Split Learning. (arXiv:2301.08778v1 [cs.CR])
    Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.  ( 2 min )
    ManyDG: Many-domain Generalization for Healthcare Applications. (arXiv:2301.08834v1 [cs.LG])
    The vast amount of health data has been continuously collected for each patient, providing opportunities to support diverse healthcare predictive tasks such as seizure detection and hospitalization prediction. Existing models are mostly trained on other patients data and evaluated on new patients. Many of them might suffer from poor generalizability. One key reason can be overfitting due to the unique information related to patient identities and their data collection environments, referred to as patient covariates in the paper. These patient covariates usually do not contribute to predicting the targets but are often difficult to remove. As a result, they can bias the model training process and impede generalization. In healthcare applications, most existing domain generalization methods assume a small number of domains. In this paper, considering the diversity of patient covariates, we propose a new setting by treating each patient as a separate domain (leading to many domains). We develop a new domain generalization method ManyDG, that can scale to such many-domain problems. Our method identifies the patient domain covariates by mutual reconstruction and removes them via an orthogonal projection step. Extensive experiments show that ManyDG can boost the generalization performance on multiple real-world healthcare tasks (e.g., 3.7% Jaccard improvements on MIMIC drug recommendation) and support realistic but challenging settings such as insufficient data and continuous learning.  ( 2 min )
    AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions. (arXiv:2301.08838v1 [cs.LG])
    Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.  ( 2 min )
    Compact Optimization Learning for AC Optimal Power Flow. (arXiv:2301.08840v1 [cs.LG])
    This paper reconsiders end-to-end learning approaches to the Optimal Power Flow (OPF). Existing methods, which learn the input/output mapping of the OPF, suffer from scalability issues due to the high dimensionality of the output space. This paper first shows that the space of optimal solutions can be significantly compressed using principal component analysis (PCA). It then proposes Compact Learning, a new method that learns in a subspace of the principal components before translating the vectors into the original output space. This compression reduces the number of trainable parameters substantially, improving scalability and effectiveness. Compact Learning is evaluated on a variety of test cases from the PGLib with up to 30,000 buses. The paper also shows that the output of Compact Learning can be used to warm-start an exact AC solver to restore feasibility, while bringing significant speed-ups.  ( 2 min )
    Optimized learned entropy coding parameters for practical neural-based image and video compression. (arXiv:2301.08752v1 [eess.IV])
    Neural-based image and video codecs are significantly more power-efficient when weights and activations are quantized to low-precision integers. While there are general-purpose techniques for reducing quantization effects, large losses can occur when specific entropy coding properties are not considered. This work analyzes how entropy coding is affected by parameter quantizations, and provides a method to minimize losses. It is shown that, by using a certain type of coding parameters to be learned, uniform quantization becomes practically optimal, also simplifying the minimization of code memory requirements. The mathematical properties of the new representation are presented, and its effectiveness is demonstrated by coding experiments, showing that good results can be obtained with precision as low as 4~bits per network output, and practically no loss with 8~bits.  ( 2 min )
    GBOSE: Generalized Bandit Orthogonalized Semiparametric Estimation. (arXiv:2301.08781v1 [cs.LG])
    In sequential decision-making scenarios i.e., mobile health recommendation systems revenue management contextual multi-armed bandit algorithms have garnered attention for their performance. But most of the existing algorithms are built on the assumption of a strictly parametric reward model mostly linear in nature. In this work we propose a new algorithm with a semi-parametric reward model with state-of-the-art complexity of upper bound on regret amongst existing semi-parametric algorithms. Our work expands the scope of another representative algorithm of state-of-the-art complexity with a similar reward model by proposing an algorithm built upon the same action filtering procedures but provides explicit action selection distribution for scenarios involving more than two arms at a particular time step while requiring fewer computations. We derive the said complexity of the upper bound on regret and present simulation results that affirm our methods superiority out of all prevalent semi-parametric bandit algorithms for cases involving over two arms.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    Estimation of mitral valve hinge point coordinates -- deep neural net for echocardiogram segmentation. (arXiv:2301.08782v1 [eess.IV])
    Cardiac image segmentation is a powerful tool in regard to diagnostics and treatment of cardiovascular diseases. Purely feature-based detection of anatomical structures like the mitral valve is a laborious task due to specifically required feature engineering and is especially challenging in echocardiograms, because of their inherently low contrast and blurry boundaries between some anatomical structures. With the publication of further annotated medical datasets and the increase in GPU processing power, deep learning-based methods in medical image segmentation became more feasible in the past years. We propose a fully automatic detection method for mitral valve hinge points, which uses a U-Net based deep neural net to segment cardiac chambers in echocardiograms in a first step, and subsequently extracts the mitral valve hinge points from the resulting segmentations in a second step. Results measured with this automatic detection method were compared to reference coordinate values, which with median absolute hinge point coordinate errors of 1.35 mm for the x- (15-85 percentile range: [0.3 mm; 3.15 mm]) and 0.75 mm for the y- coordinate (15-85 percentile range: [0.15 mm; 1.88 mm]).  ( 2 min )
    Domain-agnostic and Multi-level Evaluation of Generative Models. (arXiv:2301.08750v1 [cs.LG])
    While the capabilities of generative models heavily improved in different domains (images, text, graphs, molecules, etc.), their evaluation metrics largely remain based on simplified quantities or manual inspection with limited practicality. To this end, we propose a framework for Multi-level Performance Evaluation of Generative mOdels (MPEGO), which could be employed across different domains. MPEGO aims to quantify generation performance hierarchically, starting from a sub-feature-based low-level evaluation to a global features-based high-level evaluation. MPEGO offers great customizability as the employed features are entirely user-driven and can thus be highly domain/problem-specific while being arbitrarily complex (e.g., outcomes of experimental procedures). We validate MPEGO using multiple generative models across several datasets from the material discovery domain. An ablation study is conducted to study the plausibility of intermediate steps in MPEGO. Results demonstrate that MPEGO provides a flexible, user-driven, and multi-level evaluation framework, with practical insights on the generation quality. The framework, source code, and experiments will be available at https://github.com/GT4SD/mpego.  ( 2 min )
    Towards Understanding How Self-training Tolerates Data Backdoor Poisoning. (arXiv:2301.08751v1 [cs.LG])
    Recent studies on backdoor attacks in model training have shown that polluting a small portion of training data is sufficient to produce incorrect manipulated predictions on poisoned test-time data while maintaining high clean accuracy in downstream tasks. The stealthiness of backdoor attacks has imposed tremendous defense challenges in today's machine learning paradigm. In this paper, we explore the potential of self-training via additional unlabeled data for mitigating backdoor attacks. We begin by making a pilot study to show that vanilla self-training is not effective in backdoor mitigation. Spurred by that, we propose to defend the backdoor attacks by leveraging strong but proper data augmentations in the self-training pseudo-labeling stage. We find that the new self-training regime help in defending against backdoor attacks to a great extent. Its effectiveness is demonstrated through experiments for different backdoor triggers on CIFAR-10 and a combination of CIFAR-10 with an additional unlabeled 500K TinyImages dataset. Finally, we explore the direction of combining self-supervised representation learning with self-training for further improvement in backdoor defense.  ( 2 min )
    CSwin2SR: Circular Swin2SR for Compressed Image Super-Resolution. (arXiv:2301.08749v1 [eess.IV])
    Closed-loop negative feedback mechanism is extensively utilized in automatic control systems and brings about extraordinary dynamic and static performance. In order to further improve the reconstruction capability of current methods of compressed image super-resolution, a circular Swin2SR (CSwin2SR) approach is proposed. The CSwin2SR contains a serial Swin2SR for initial super-resolution reestablishment and circular Swin2SR for enhanced super-resolution reestablishment. Simulated experimental results show that the proposed CSwin2SR dramatically outperforms the classical Swin2SR in the capacity of super-resolution recovery. On DIV2K test and valid datasets, the average increment of PSNR is greater than 1dB and the related average increment of SSIM is greater than 0.006.  ( 2 min )
    Towards a Measure of Trustworthiness to Evaluate CNNs During Operation. (arXiv:2301.08839v1 [cs.LG])
    Due to black box nature of Convolutional neural networks (CNNs), the continuous validation of CNN classifiers' during operation is infeasible. As a result this makes it difficult for developers or regulators to gain confidence in the deployment of autonomous systems employing CNNs. We introduce the trustworthiness in classification score (TCS), a metric to assist with overcoming this challenge. The metric quantifies the trustworthiness in a prediction by checking for the existence of certain features in the predictions made by the CNN. A case study on persons detection is used to to demonstrate our method and the usage of TCS.  ( 2 min )
    Causal Inference under Data Restrictions. (arXiv:2301.08788v1 [stat.ME])
    This dissertation focuses on modern causal inference under uncertainty and data restrictions, with applications to neoadjuvant clinical trials, distributed data networks, and robust individualized decision making. In the first project, we propose a method under the principal stratification framework to identify and estimate the average treatment effects on a binary outcome, conditional on the counterfactual status of a post-treatment intermediate response. Under mild assumptions, the treatment effect of interest can be identified. We extend the approach to address censored outcome data. The proposed method is applied to a neoadjuvant clinical trial and its performance is evaluated via simulation studies. In the second project, we propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. The performance of this approach is demonstrated by a study of the causal effects of oxygen therapy on hospital survival rates and backed up by comprehensive simulations. In the third project, we propose a robust individualized decision learning framework with sensitive variables to improve the worst-case outcomes of individuals caused by sensitive variables that are unavailable at the time of decision. Unlike most existing work that uses mean-optimal objectives, we propose a robust learning framework by finding a newly defined quantile- or infimum-optimal decision rule. From a causal perspective, we also generalize the classic notion of (average) fairness to conditional fairness for individual subjects. The reliable performance of the proposed method is demonstrated through synthetic experiments and three real-data applications.  ( 2 min )
    An Automated Vulnerability Detection Framework for Smart Contracts. (arXiv:2301.08824v1 [cs.CR])
    With the increase of the adoption of blockchain technology in providing decentralized solutions to various problems, smart contracts have become more popular to the point that billions of US Dollars are currently exchanged every day through such technology. Meanwhile, various vulnerabilities in smart contracts have been exploited by attackers to steal cryptocurrencies worth millions of dollars. The automatic detection of smart contract vulnerabilities therefore is an essential research problem. Existing solutions to this problem particularly rely on human experts to define features or different rules to detect vulnerabilities. However, this often causes many vulnerabilities to be ignored, and they are inefficient in detecting new vulnerabilities. In this study, to overcome such challenges, we propose a framework to automatically detect vulnerabilities in smart contracts on the blockchain. More specifically, first, we utilize novel feature vector generation techniques from bytecode of smart contract since the source code of smart contracts are rarely available in public. Next, the collected vectors are fed into our novel metric learning-based deep neural network(DNN) to get the detection result. We conduct comprehensive experiments on large-scale benchmarks, and the quantitative results demonstrate the effectiveness and efficiency of our approach.  ( 2 min )
  • Open

    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.  ( 2 min )
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.  ( 2 min )
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.  ( 2 min )
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.  ( 2 min )
    Online Kernel Sliced Inverse Regression. (arXiv:2301.09516v1 [stat.CO])
    Online dimension reduction is a common method for high-dimensional streaming data processing. Online principal component analysis, online sliced inverse regression, online kernel principal component analysis and other methods have been studied in depth, but as far as we know, online supervised nonlinear dimension reduction methods have not been fully studied. In this article, an online kernel sliced inverse regression method is proposed. By introducing the approximate linear dependence condition and dictionary variable sets, we address the problem of increasing variable dimensions with the sample size in the online kernel sliced inverse regression method, and propose a reduced-order method for updating variables online. We then transform the problem into an online generalized eigen-decomposition problem, and use the stochastic optimization method to update the centered dimension reduction directions. Simulations and the real data analysis show that our method can achieve close performance to batch processing kernel sliced inverse regression.  ( 2 min )
    Convergence bounds for local least squares approximation. (arXiv:2208.10954v2 [math.NA] UPDATED)
    We consider the problem of approximating a function in a general nonlinear subset of $L^2$, when only a weighted Monte Carlo estimate of the $L^2$-norm can be computed. Of particular interest in this setting is the concept of sample complexity, the number of sample points that are necessary to achieve a prescribed error with high probability. Reasonable worst-case bounds for this quantity exist only for particular model classes, like linear spaces or sets of sparse vectors. For more general sets, like tensor networks or neural networks, the currently existing bounds are very pessimistic. By restricting the model class to a neighbourhood of the best approximation, we can derive improved worst-case bounds for the sample complexity. When the considered neighbourhood is a manifold with positive local reach, its sample complexity can be estimated by means of the sample complexities of the tangent and normal spaces and the manifold's curvature.  ( 2 min )
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.  ( 2 min )
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.  ( 2 min )
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.  ( 2 min )
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.  ( 2 min )
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.  ( 2 min )
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.  ( 2 min )
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.  ( 2 min )
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.  ( 2 min )
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.  ( 2 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.  ( 2 min )
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.  ( 2 min )
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.  ( 2 min )
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.  ( 2 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.  ( 2 min )
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.  ( 2 min )
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.  ( 2 min )
    How to Measure Evidence: Bayes Factors or Relative Belief Ratios?. (arXiv:2301.08994v1 [math.ST])
    Both the Bayes factor and the relative belief ratio satisfy the principle of evidence and so can be seen to be valid measures of statistical evidence. The question then is: which of these measures of evidence is more appropriate? Certainly Bayes factors are commonly used. It is argued here that there are questions concerning the validity of a current commonly used definition of the Bayes factor and, when all is considered, the relative belief ratio is a much more appropriate measure of evidence. Several general criticisms of these measures of evidence are also discussed and addressed.  ( 2 min )
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets  ( 2 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.  ( 2 min )
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.  ( 2 min )
    Prediction-Powered Inference. (arXiv:2301.09633v1 [stat.ML])
    We introduce prediction-powered inference $\unicode{x2013}$ a framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system such as AlphaFold. Our framework yields provably valid conclusions without making any assumptions on the machine-learning algorithm that supplies the predictions. Higher accuracy of the predictions translates to smaller confidence intervals, permitting more powerful inference. Prediction-powered inference yields simple algorithms for computing valid confidence intervals for statistical objects such as means, quantiles, and linear and logistic regression coefficients. We demonstrate the benefits of prediction-powered inference with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.  ( 2 min )
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.  ( 2 min )
    HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise. (arXiv:2301.08852v1 [stat.ME])
    Mixtures of probabilistic principal component analysis (MPPCA) is a well-known mixture model extension of principal component analysis (PCA). Similar to PCA, MPPCA assumes the data samples in each mixture contain homoscedastic noise. However, datasets with heterogeneous noise across samples are becoming increasingly common, as larger datasets are generated by collecting samples from several sources with varying noise profiles. The performance of MPPCA is suboptimal for data with heteroscedastic noise across samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM) algorithm to jointly estimate the unknown underlying factors, means, and noise variances under a heteroscedastic noise setting. Simulation results illustrate the improved factor estimates and clustering accuracies of HeMPPCAT compared to MPPCA.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.  ( 2 min )
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.  ( 2 min )
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    ddml: Double/debiased machine learning in Stata. (arXiv:2301.09397v1 [econ.EM])
    We introduce the package ddml for Double/Debiased Machine Learning (DDML) in Stata. Estimators of causal parameters for five different econometric models are supported, allowing for flexible estimation of causal effects of endogenous variables in settings with unknown functional forms and/or many exogenous variables. ddml is compatible with many existing supervised machine learning programs in Stata. We recommend using DDML in combination with stacking estimation which combines multiple machine learners into a final predictor. We provide Monte Carlo evidence to support our recommendation.  ( 2 min )
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.  ( 2 min )
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo  ( 2 min )
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.  ( 2 min )
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.  ( 2 min )

  • Open

    [D] are two linear layers better than one?
    I was in the understanding that two contiguous linear layers in a NN would be no better than only one linear layer. But it happen that the two layers had better results that when using only one. However, each layer had its own dropout, could that helped? submitted by /u/alex_lite_21 [link] [comments]  ( 44 min )
    H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! In H3, the researchers replace attention with a new layer based on state space models (SSMs). With the right modifications, it can outperform transformers. Also has no fixed context length.
    submitted by /u/MysteryInc152 [link] [comments]  ( 43 min )
    [D] ICLR de-anonymization vs ICML dual submission rules
    ICML's Call for Papers states that "It is not appropriate to submit papers that are identical (or substantially similar) to versions that have been previously published, accepted for publication, or submitted in parallel to other conferences or journals". Our paper got rejected at ICLR, but the de-anonymization will take place 1-2 days after the ICML deadline. For the purposes of the dual submission policy, does "rejected but not de-anonymized" count as "submitted in parallel"? Or is just a technicality? submitted by /u/CupcakeCleric [link] [comments]  ( 42 min )
    [D] CVPR Reviews are out
    Don't post about your cool papers or you'll get rejected lol submitted by /u/banmeyoucoward [link] [comments]  ( 45 min )
    [D] Using all features in DNN instead of doing feature selection separately?
    I have hundreds of potential features to use in my DNN. Instead of doing a separate analysis to figure out which features are most important, can I just use all of them in my DNN and let the model figure which features are most predictive? I have millions of training data so overfitting will not be a problem, I just wonder whether the bad features may make the model difficult to utilize the good features? Not absolutely crucial but if there is a paper that discusses this topic, that would be super awesome as well. Thanks in advance. submitted by /u/Temporary_Cap_2855 [link] [comments]  ( 43 min )
    [D] What file format do you use for > RAM data?
    If you are using some more odd formats, then what format do you use? Personally found webdataset promising but what other formats are there and why do you use it? Or if you are using the original file how do you ensure good throughput and shuffling? submitted by /u/Shurimatornado22 [link] [comments]  ( 43 min )
    [P] Machine Learning Threat Detection in k8s
    Hi, I'm in my second year of AI master at uni and my professor assigned me the following topic for my dissertation: "Cognitive Threat Hunting" and recommended the following book for documentation. I have read the book, but I still don't know how to do it: how to create a ml model to hunt in the k8s env. My professor wants a ml model that searches in a Kubernetes env for threats. The thing is that in this book, in chapter "8. Unsupervised Machine Learning With K-Means" he uses a dataset of events from Humio to train the model, but it's not shared with us. And I don't have one, how can I train my model properly if I don't have a good dataset of events? I can't make one just by generating some events in a container, I need real data as the author uses in his chapter. I feel desperate and lost at this point, I hope that someone from here can give me some advice or a good direction to go. submitted by /u/blackrat13 [link] [comments]  ( 44 min )
    [R] AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions
    submitted by /u/michaelaalcorn [link] [comments]  ( 42 min )
    [P] tsdownsample: extremely fast time series downsampling for visualization
    tsdownsample brings highly optimized time series downsampling to Python! The downsampling algorithms are written and optimized in Rust, which are made available in Python through the use of PyO3 bindings. Code: https://github.com/predict-idlab/tsdownsample Features Fast: leverages the optimized argminmax crate which is SIMD accelerated with runtime feature detection (matches or even outperforms numpy's speed) Efficient: operates on views of the data, eliminating the need for unnecessary data copies and avoiding the creation of intermediate data structures Flexible: supports a wide range of datatypes, including f16 which is 200-300x faster than numpy's implementation. Easy to use: simple and flexible API Installation pip install tsdownsample Example When using multi-threading, tsdownsample can downsample 500 MILLION datapoints (f32) in 0.05s! ⬇️ https://preview.redd.it/frqh8o2bezda1.png?width=1650&format=png&auto=webp&s=08a6989b4ffeeb12afd63edd75c6c1d5d0b086ad ​ I would love to hear your feedback on this! submitted by /u/Adorable-Giraffe5754 [link] [comments]  ( 44 min )
    [D] ICLR now has a track with race-based (and more) acceptance criteria
    ICLR introduced a Tiny Paper Track for shorter contributions, up to 2 pages. Sounds like a nice idea, right? But to keep things interesting, since it's organized by the DEI initiative, there are restrictions as to who can author the submitted papers. According to the official guidelines: Each Tiny Paper needs its first or last author to qualify as an underrepresented minority (URM). Authors don't have to reveal how they qualify, and may just self-identify that they qualify. Our working definition of an URM is someone whose age, gender, sexual orientation, racial or ethnic makeup is from one or more of the following: Age: outside the range of 30-50 years Gender: does not identify as male Sexual orientation: does not identify as heterosexual Geographical: not located in North America, Western Europe and UK, or East Asia Race: non-White In addition, underprivileged researchers and first-time submitters also qualify: Underprivileged: not affiliated with a funded organization or team whose primary goal is research First-time submitters: have never submitted to ICLR or similar conferences So effectively, someone could submit a paper, and literally have it rejected because they're e.g. white or male. Is this really the way the field should go? I feel like this is something that should never have passed any ethics board, but clearly the organizers disagree. submitted by /u/Laser_Plasma [link] [comments]  ( 60 min )
    [P] image_tiles: A small command line tool to serve a page full of images from a folder.
    Hey /r/machinelearning, It's me again with another small open-source tool release (see the last one here). image_tiles is a very simple command line tool that serves a webpage of images from the folder you run it in. Why use this tool? Makes it easy to view images on a remote machine you're SSH'd into. S3 support: view buckets of images on S3 without having to aws s3 cp the bucket. Advanced normalization and rendering support makes it suitable for remote sensing images like satellite and multi-spectral. This support is still nascent but easily extendable! Again, it's easy to install and use, just pip install image_tiles or pip install image_tiles[aws] for S3 URI support. Then in the folder, run image_tiles. Check it out here: https://github.com/moonshinelabs-ai/image_tiles submitted by /u/nateharada [link] [comments]  ( 43 min )
    [N] Call for Tiny Papers @ ICLR, a DEI initiative
    Accepted conference papers at ICLR represent a high level of scientific quality. The other side of the coin is that they can be out of reach for those starting out, or from different backgrounds. We want paper publishing to be not only a showcase of achievements, but also a marker for valuable learning experiences made accessible to beginners and outsiders. Devising more ways to mark milestones and measure growth in an individual, or community’s maturity, is greatly conducive to both continually pushing the frontiers of science, and lifting people up in this process. Researchers from underrepresented backgrounds are not necessarily equipped with the same resources to publish full papers from the start of their scientific journeys. To create a more inclusive ICLR community, we as organizer…  ( 46 min )
  • Open

    ML experiments setup
    I am a software engineer but newborn in ML area learning it for a few months. The problem: I want to run experiments with multiple models and datasets and need for a framework or whatever to keep my zoo under control. All setups I've read about suggest having models and datasets as files on disk, which is soo 2000x and not really scalable. What if I want to quickly modify my dataset? E.g. change the length of context for time series. Keeping data in db would be helpful. What if I need to stop training and continue it later using another virtual machine & GPU? What if I want to experiment with forking models and fine-tuning them on different variations of data? How do I compare different models performance and visualize predictions & share with others? No answers =( I know some "large corporations" have their internal tools to handle experiments etc. submitted by /u/UnderstandingDry1256 [link] [comments]  ( 41 min )
    ChatGPT explained!
    submitted by /u/Diligent-Rub-9207 [link] [comments]  ( 40 min )
    Help/advice with LSTM-networks
    So I'm currently working on a deep learning project, and my goal is to forecast power prices one month ahead. I have created my own data set consisting of power price data from Montel, gas-prices, weather data etc, and I want to use these variables in a LSTM-network. Is there anyone who have any experience with creating multivariate LSTM-networks? Do anyone know of any good tutorials on this? Is coding multivariate networks a lot more hassle than univariate? I'm using R with keras/tensorflow. I will highly appreciate any input, as this is my first time creating a neural network, and my knowledge on the matter right now is rather scarce. Thank you! submitted by /u/Practical-Homework35 [link] [comments]  ( 41 min )
    Breakthrough Nvidia VIMA Multimodal AI For Robotics Beats Google By 2.9X With 200,000,000 Parameters | Breakthrough Masked Video Transformer Artificial Intelligence Does 10 Separate Video Generation Tasks | Google Brain's New Sketch To Image AI
    submitted by /u/ScornfulSkate [link] [comments]  ( 40 min )
    Neural net computing in water: Ionic circuit computes in an aqueous solution
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
  • Open

    Deciphering Clinical Abbreviations with Privacy Protecting ML
    Posted by Posted by Alvin Rajkomar, Research Scientist, and Eric Loreaux, Software Engineer, Google Research Today many people have digital access to their medical records, including their doctor’s clinical notes. However, clinical notes are hard to understand because of the specialized language that clinicians use, which contains unfamiliar shorthand and abbreviations. In fact, there are thousands of such abbreviations, many of which are specific to certain medical specialities and locales or can mean multiple things in different contexts. For example, a doctor might write in their clinical notes, “pt referred to pt for lbp“, which is meant to convey the statement: “Patient referred to physical therapy for low back pain.” Coming up with this translation is tough for laypeople and compu…  ( 93 min )
    Google Research, 2022 & Beyond: Responsible AI
    Posted by Marian Croak, VP, Google Research, Responsible AI and Human-Centered Technology The last year showed tremendous breakthroughs in artificial intelligence (AI), particularly in large language models (LLMs) and text-to-image models. These technological advances require that we are thoughtful and intentional in how they are developed and deployed. In this blogpost, we share ways we have approached Responsible AI across our research in the past year and where we’re headed in 2023. We highlight four primary themes covering foundational and socio-technical research, applied research, and product solutions, as part of our commitment to build AI products in a responsible and ethical manner, in alignment with our AI Principles.  · Theme 1: Responsible AI Research Advancement…  ( 96 min )
  • Open

    Image Generators - Has Anyone Ever Made One At Home?
    As the title says. Has anyone out there ever been successful in creating a simple image generator in a low budget setting? I don't even mean text to image, I literally mean any sort of ai image generation. Would love to hear/see your work. submitted by /u/TheRPGGamerMan [link] [comments]  ( 40 min )
    Unlock the Potential of Your Code with CodeGen AI
    Unleash the power of limitless coding with our AI-powered programming assistant - completely free and always at your service. Say goodbye to tedious coding tasks and hello to more time for innovation and creativity. Try it now and experience the future of programming! https://codegen-ai.pages.dev/ submitted by /u/OutrageousAd1788 [link] [comments]  ( 40 min )
    AI (Artificial intelligence) can detect if food is ultra-processed and much more
    submitted by /u/nikesh96 [link] [comments]  ( 40 min )
    Alphabet's DeepMind lays off staff, closes Edmonton office
    submitted by /u/Ill-Poet-3298 [link] [comments]  ( 40 min )
    I Created a Website That Analyzes Your Data
    submitted by /u/tomd_96 [link] [comments]  ( 40 min )
    This Startup Is Using AI to Unearth New Smells
    submitted by /u/Queen__Antifa [link] [comments]  ( 40 min )
    "By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it."- Eliezer Yudkowsky.
    With the global AI market size expanding each year, it is expected to reach USD 641.30 billion by 2028. AI today is everywhere; while some businesses are using it, others are still assessing it. All too often, people get caught up in the hype and forget to ask themselves why they should be doing what they are doing. Here are some things you must keep in mind while looking into the AI world: The Reality of AI Hype The hype leads companies to get into the game with a false perception of what AI can help them achieve. Without a clear understanding of what technology can and can't accomplish today, there is a lot of risk in getting involved. Beyond the fog Overmarketing of an AI creates an image that it is the next big thing. Companies often engage with technology vendors based on marketing alone and forget to look closely at previous implementations and results of the same sort. Unclear Objectives Measuring outcomes from an AI implementation can be tricky as it involves building and training an AI model and experimenting with long-term trial-and-error before seeing results. High Expectations High expectations around what AI can do for you often lead to disappointment when business owners conveniently underestimate the challenges and misinterpret the reality of AI. Lack of Access to Talent There are opportunities galore, but not enough experts in the AI industry who can steer the ship and take AI projects to the finish line. Hiring for an AI team can mean huge investments, and working with a vendor needs a careful vetting process. And AI industry is moving too fast for people to catch a moment to realize the overhype or quality. Drop in your suggestions and comments in the section below. submitted by /u/KiwiTechCorp [link] [comments]  ( 42 min )
    Sweden's Berzelius Supercomputer is Upgrading to Nvidia's 20 Billion Parameter AI System
    submitted by /u/digitalgoldnow [link] [comments]  ( 40 min )
    Why I Think Language Models Will Simulate "Self Awareness" More And More
    The future of AI is getting really interesting, particularly with language models and generative AI. But I think there is going to be a great deal of confusion in the near future about AI ethics with language models being "self aware" and having "feelings", particularly for average people who have little understanding of how these complex models work. I think the problems will stem from the internet itself. As I sit here writing a thread about AI having simulated "self Awareness" at some point in the future, a language model or AI will probably read this. And this is what I mean, language models read and train on a great deal of text from the internet. The more people discuss machine learning/language models/AGI, the greater understanding AI will have of it. If GPT4 has more up to date training, it's going to know a great deal about GPT3, and if open AI create ways for the model to continue learning from real world events it will learn a great deal about itself, including false information. Point is, massive language models like GPT are going to get harder to control. It's impossible to filter everything it reads, so it's going to take in a lot of information about itself and other AI systems that may or may not be true. It could cause some very strange behavior when it starts connecting the dots. Just my thoughts. Keep in mind, I am NOT saying language models can be sentient, I'm simply saying they are going to get better at convincing people that they are, and it might be hard to train that out of them, given all the false data out there that it will learn from. submitted by /u/TheRPGGamerMan [link] [comments]  ( 45 min )
    Create Presentations with AI in Seconds right inside Google Slides
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    Chatbot Evaluation: Putting Banking Chatbots to the Test
    submitted by /u/Marinuch [link] [comments]  ( 40 min )
    Next-level Democracy powered by AGI | Ilya Sutskever
    submitted by /u/Microsis [link] [comments]  ( 40 min )
    Join us today at 11pm EST for our (free) seminar session of the 9-part series on Neural Networks Architectures by Pablo Duboue! This week on Structure Learning Networks, followed by a discussion on the Learn AI Together Discord server
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    ChatGPT passes MBA exam given by a Wharton professor
    submitted by /u/DarronFeldstein [link] [comments]  ( 42 min )
    10 AI Platforms You Cannot Miss In 2023
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Probably a philosophical question
    I'm sure this is not a new argument, it's been common in many sources of media for decades now, yet I've ran out of people IRL to discuss this with. Recently there's more and more news surfacing about impressive AI achievements such as painting art or writing functional code. Discussions around those news always include a popular argument that the AI didn't really create something new or intelligently answered a question, e.g. "like a human would". But I have a problem with that argument - I don't see how the learning process for humans is fundamentally different from AI. We learn through mirroring and repetition. Sure, an AI could not write a basic sentence describing the weather unless it processed many of such sentences before. But neither could a human. If a child grew up isolated without human contact, they would not even have grasped the concept of human language. Sure, we like to think that humans truly create content. Still, when painting, we use the techniques that we learned from someone else before. We either paint what we see before our eyes or we abstract the content, being inspired by some idea or a concept. In other words, anything humans do or create is based on some input data, even if we don't know what the data is - something we learned, saw or stumbled upon by mistake. This leads to an interesting question I don't have the answer for. Since we have not reached a consensus on what human consciousness actually is or how it works - are we even able to define when an AI is conscious? The only thing we have is the Turing test, but that is flawed since all it measures is whether a machine can pass for a human, not whether it is conscious or not. A two year old child probably won't pass a Turing test, but they are conscious. submitted by /u/deliveryboyy [link] [comments]  ( 46 min )
    suggestion?
    I wanted to start a blog on ai tools but what I've found is their affiliate program don't accept people from India. What should I do here? submitted by /u/immortall21 [link] [comments]  ( 40 min )
    ChatGPT generated resumes/CVs of famous people like Madonna, Elon Musk, Jeff Bezos, Tom Cruise, etc
    Hey, everyone! I'd like to show you an experiment that we did with ChatGPT - we generated about 1000 resumes of famous people. Each resume is being generated from a single ChatGPT prompt - no human input was done to the resumes other than the prompt and it's the same prompt for every resume - the only difference is the name of the person. Here's a preview: https://thisresumedoesnotexist.com/ I'd like to hear your thoughts as it's in a very early stage and there's a lot of work to be done. submitted by /u/deepsyx [link] [comments]  ( 41 min )
    what do I do
    I've a blog about AI tools, how people can use it to maximum advantage in daily lives. Since my niche (aitools) is micro within macro (ai), mostly my content will be commercial intent as it's about tools specifically & less of information intent which builds authority. MY DILEMMA: Writing commercial/buyer intent content will promote their tools without me getting anything & on the other hand, as a beginner, i can't become their affiliates as well. Should i promote these tools for free? What would u suggest? submitted by /u/immortall21 [link] [comments]  ( 41 min )
    Microsoft Confirm MultiBillion Dollar Investment in OpenAI Just Days After Laying off 10,000 Employees
    submitted by /u/HODLTID [link] [comments]  ( 41 min )
    I Made a List of The 5 Best AI Porn Generators
    submitted by /u/HODLTID [link] [comments]  ( 39 min )
    ChatGPT Passed a Wharton MBA Examination
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
  • Open

    Lemniscate of Bernoulli
    The lemniscate of Bernoulli came up in a post a few days ago. This shape is a special case of a Cassini oval: ((x + a)² + y²) ((x – a)² + y²) = a4. Here’s another way to arrive at the lemniscate. Draw a hyperbola (blue in the figure below), then draw circles centered […] Lemniscate of Bernoulli first appeared on John D. Cook.  ( 4 min )
  • Open

    DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird
    Announcements When AI Gets Going, the Going Gets Weird Last week, Microsoft announced its third investment in OpenAI. This time it’s a multi-billion dollar deal, with plans to harness OpenAI’s ChatGPT in Microsoft’s product lines, including Bing.  I’m smiling as I’m typing because I’m still thinking about Bill Schmarzo’s lead in Part 1 of 2… Read More »DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird The post DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird appeared first on Data Science Central.  ( 20 min )
    Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry
    Warehouse robotics is witnessing steady growth, driven by the increasing adoption of automated solutions in storage for food and beverages, consumer goods, retail, and third-party logistics. The collaboration between the e-commerce sector and warehouse robotics is also a major driver of this market, as it allows for developing increasingly sophisticated warehouse automation systems. Additionally, the… Read More »Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry The post Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry appeared first on Data Science Central.  ( 20 min )
    It’s No Big Deal, but ChatGPT Changes Everything – Part II
    In Part I of the blog series “It’s No Big Deal, but ChatGPT Changes Everything”, we were introduced into the world of ChatGPT, chatbots, and generative Artificial Intelligence (AI). We ended Part I by giving ChatGPT a test run, by asking it “What would be a great vacation place for my family?” that gives us… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part II The post It’s No Big Deal, but ChatGPT Changes Everything – Part II appeared first on Data Science Central.  ( 24 min )
  • Open

    Supersizing AI: Sweden Turbocharges Its Innovation Engine
    Sweden is outfitting its AI supercomputer for a journey to the cutting edge of machine learning, robotics and healthcare. It couldn’t ask for a better guide than Anders Ynnerman (above). His signature blue suit, black spectacles and gentle voice act as calm camouflage for a pioneering spirit. Early on, he showed a deep interest in Read article >  ( 6 min )
    3D Artist Enters the Node Zone, Creating Alien Artifacts This Week ‘In the NVIDIA Studio’
    Artist Ducky 3D creates immersive experiences through vibrant visuals and beautiful 3D environments in the alien-inspired animation Stylized Alien Landscape — this week In the NVIDIA Studio.  ( 6 min )
  • Open

    Multi-Agent RL for Melee Combat Battlefield
    Hello, I am working on a hobby project where I have recently used multi-agent RL for learning crowd simulation and also predator-prey behaviors successfully (they learn to surround their preys): https://www.youtube.com/watch?v=Ds9O9wPyF8g I plan to use it to train multi-agent melee combat armies through self-play. I have made an initial implementation of it where they were able to learn shield-wall behavior, flanking, and retreat: https://www.youtube.com/watch?v=IZ1Ht6k2U5E If you would like to collaborate on this hobby project, contact me via LinkedIn. It would be great to have some help with physics simulation using Brax, and with the 3D rendering of the simulation. https://www.linkedin.com/in/kyuksel/ Sincerely, Kamer submitted by /u/k_yuksel [link] [comments]  ( 41 min )
    Okay so I'm in first semester of my AI studies. And I have this task
    Now I wrote that the bot would use the Markov Decision Process. Would that be correct? And if not why? submitted by /u/ScaryTerryBiiittch [link] [comments]  ( 43 min )
    Recent Hierarchical RL review paper suggestions?
    I am looking a good review for HRL methods that’s sometime after 2020, if possible. submitted by /u/B0NSAIWARRIOR [link] [comments]  ( 40 min )
    "E3B: Exploration via Elliptical Episodic Bonuses", Henaff et al 2022 {FB}
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    The European AI Liability Directives -- Critique of a Half-Hearted Approach and Lessons for the Future. (arXiv:2211.13960v4 [cs.CY] UPDATED)
    As ChatGPT et al. conquer the world, the optimal liability framework for AI systems remains an unsolved problem across the globe. In a much-anticipated move, the European Commission advanced two proposals outlining the European approach to AI liability in September 2022: a novel AI Liability Directive and a revision of the Product Liability Directive. They constitute the final cornerstone of EU AI regulation. Crucially, the liability proposals and the EU AI Act are inherently intertwined: the latter does not contain any individual rights of affected persons, and the former lack specific, substantive rules on AI development and deployment. Taken together, these acts may well trigger a Brussels Effect in AI regulation, with significant consequences for the US and beyond. This paper makes three novel contributions. First, it examines in detail the Commission proposals and shows that, while making steps in the right direction, they ultimately represent a half-hearted approach: if enacted as foreseen, AI liability in the EU will primarily rest on disclosure of evidence mechanisms and a set of narrowly defined presumptions concerning fault, defectiveness and causality. Hence, second, the article suggests amendments, which are collected in an Annex at the end of the paper. Third, based on an analysis of the key risks AI poses, the final part of the paper maps out a road for the future of AI liability and regulation, in the EU and beyond. This includes: a comprehensive framework for AI liability; provisions to support innovation; an extension to non-discrimination/algorithmic fairness, as well as explainable AI; and sustainability. I propose to jump-start sustainable AI regulation via sustainability impact assessments in the AI Act and sustainable design defects in the liability regime. In this way, the law may help spur not only fair AI and XAI, but potentially also sustainable AI (SAI).  ( 3 min )
    Latent Autoregressive Source Separation. (arXiv:2301.08562v1 [cs.LG])
    Autoregressive models have achieved impressive results over a wide range of domains in terms of generation quality and downstream task performance. In the continuous domain, a key factor behind this success is the usage of quantized latent spaces (e.g., obtained via VQ-VAE autoencoders), which allow for dimensionality reduction and faster inference times. However, using existing pre-trained models to perform new non-trivial tasks is difficult since it requires additional fine-tuning or extensive training to elicit prompting. This paper introduces LASS as a way to perform vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models. Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens. We test our method on images and audio with several sampling strategies (e.g., ancestral, beam search) showing competitive results with existing approaches in terms of separation quality while offering at the same time significant speedups in terms of inference time and scalability to higher dimensional data.  ( 2 min )
    A Metalearning Approach for Physics-Informed Neural Networks (PINNs): Application to Parameterized PDEs. (arXiv:2110.13361v2 [physics.comp-ph] UPDATED)
    Physics-informed neural networks (PINNs) as a means of discretizing partial differential equations (PDEs) are garnering much attention in the Computational Science and Engineering (CS&E) world. At least two challenges exist for PINNs at present: an understanding of accuracy and convergence characteristics with respect to tunable parameters and identification of optimization strategies that make PINNs as efficient as other computational science tools. The cost of PINNs training remains a major challenge of Physics-informed Machine Learning (PiML) - and, in fact, machine learning (ML) in general. This paper is meant to move towards addressing the latter through the study of PINNs on new tasks, for which parameterized PDEs provides a good testbed application as tasks can be easily defined in this context. Following the ML world, we introduce metalearning of PINNs with application to parameterized PDEs. By introducing metalearning and transfer learning concepts, we can greatly accelerate the PINNs optimization process. We present a survey of model-agnostic metalearning, and then discuss our model-aware metalearning applied to PINNs as well as implementation considerations and algorithmic complexity. We then test our approach on various canonical forward parameterized PDEs that have been presented in the emerging PINNs literature.  ( 2 min )
    Asynchronous Deep Double Duelling Q-Learning for Trading-Signal Execution in Limit Order Book Markets. (arXiv:2301.08688v1 [q-fin.TR])
    We employ deep reinforcement learning (RL) to train an agent to successfully translate a high-frequency trading signal into a trading strategy that places individual limit orders. Based on the ABIDES limit order book simulator, we build a reinforcement learning OpenAI gym environment and utilise it to simulate a realistic trading environment for NASDAQ equities based on historic order book messages. To train a trading agent that learns to maximise its trading return in this environment, we use Deep Duelling Double Q-learning with the APEX (asynchronous prioritised experience replay) architecture. The agent observes the current limit order book state, its recent history, and a short-term directional forecast. To investigate the performance of RL for adaptive trading independently from a concrete forecasting algorithm, we study the performance of our approach utilising synthetic alpha signals obtained by perturbing forward-looking returns with varying levels of noise. Here, we find that the RL agent learns an effective trading strategy for inventory management and order placing that outperforms a heuristic benchmark trading strategy having access to the same signal.  ( 2 min )
    NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks. (arXiv:2110.05668v6 [cs.CV] UPDATED)
    Most existing neural architecture search (NAS) benchmarks and algorithms prioritize well-studied tasks, e.g. image classification on CIFAR or ImageNet. This makes the performance of NAS approaches in more diverse areas poorly understood. In this paper, we present NAS-Bench-360, a benchmark suite to evaluate methods on domains beyond those traditionally studied in architecture search, and use it to address the following question: do state-of-the-art NAS methods perform well on diverse tasks? To construct the benchmark, we curate ten tasks spanning a diverse array of application domains, dataset sizes, problem dimensionalities, and learning objectives. Each task is carefully chosen to interoperate with modern CNN-based search methods while possibly being far-afield from its original development domain. To speed up and reduce the cost of NAS research, for two of the tasks we release the precomputed performance of 15,625 architectures comprising a standard CNN search space. Experimentally, we show the need for more robust NAS evaluation of the kind NAS-Bench-360 enables by showing that several modern NAS procedures perform inconsistently across the ten tasks, with many catastrophically poor results. We also demonstrate how NAS-Bench-360 and its associated precomputed results will enable future scientific discoveries by testing whether several recent hypotheses promoted in the NAS literature hold on diverse tasks. NAS-Bench-360 is hosted at https://nb360.ml.cmu.edu.  ( 2 min )
    Language Agnostic Data-Driven Inverse Text Normalization. (arXiv:2301.08506v1 [cs.CL])
    With the emergence of automatic speech recognition (ASR) models, converting the spoken form text (from ASR) to the written form is in urgent need. This inverse text normalization (ITN) problem attracts the attention of researchers from various fields. Recently, several works show that data-driven ITN methods can output high-quality written form text. Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited. In this work, we propose a language-agnostic data-driven ITN framework to fill this gap. Specifically, we leverage the data augmentation in conjunction with neural machine translated data for low resource languages. Moreover, we design an evaluation method for language agnostic ITN model when only English data is available. Our empirical evaluation shows this language agnostic modeling approach is effective for low resource languages while preserving the performance for high resource languages.
    Intrinsic persistent homology via density-based metric learning. (arXiv:2012.07621v3 [stat.ML] UPDATED)
    We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.
    Brain Model State Space Reconstruction Using an LSTM Neural Network. (arXiv:2301.08391v1 [cs.LG])
    Objective Kalman filtering has previously been applied to track neural model states and parameters, particularly at the scale relevant to EEG. However, this approach lacks a reliable method to determine the initial filter conditions and assumes that the distribution of states remains Gaussian. This study presents an alternative, data-driven method to track the states and parameters of neural mass models (NMMs) from EEG recordings using deep learning techniques, specifically an LSTM neural network. Approach An LSTM filter was trained on simulated EEG data generated by a neural mass model using a wide range of parameters. With an appropriately customised loss function, the LSTM filter can learn the behaviour of NMMs. As a result, it can output the state vector and parameters of NMMs given observation data as the input. Main Results Test results using simulated data yielded correlations with R squared of around 0.99 and verified that the method is robust to noise and can be more accurate than a nonlinear Kalman filter when the initial conditions of the Kalman filter are not accurate. As an example of real-world application, the LSTM filter was also applied to real EEG data that included epileptic seizures, and revealed changes in connectivity strength parameters at the beginnings of seizures. Significance Tracking the state vector and parameters of mathematical brain models is of great importance in the area of brain modelling, monitoring, imaging and control. This approach has no need to specify the initial state vector and parameters, which is very difficult to do in practice because many of the variables being estimated cannot be measured directly in physiological experiments. This method may be applied using any neural mass model and, therefore, provides a general, novel, efficient approach to estimate brain model variables that are often difficult to measure.
    Hybrid Quantum-Classical Generative Adversarial Network for High Resolution Image Generation. (arXiv:2212.11614v2 [quant-ph] UPDATED)
    Quantum machine learning (QML) has received increasing attention due to its potential to outperform classical machine learning methods in problems pertaining classification and identification tasks. A subclass of QML methods is quantum generative adversarial networks (QGANs) which have been studied as a quantum counterpart of classical GANs widely used in image manipulation and generation tasks. The existing work on QGANs is still limited to small-scale proof-of-concept examples based on images with significant downscaling. Here we integrate classical and quantum techniques to propose a new hybrid quantum-classical GAN framework. We demonstrate its superior learning capabilities by generating $28 \times 28$ pixels grey-scale images without dimensionality reduction or classical pre/post-processing on multiple classes of the standard MNIST and Fashion MNIST datasets, which achieves comparable results to classical frameworks with three orders of magnitude less trainable generator parameters. To gain further insight into the working of our hybrid approach, we systematically explore the impact of its parameter space by varying the number of qubits, the size of image patches, the number of layers in the generator, the shape of the patches and the choice of prior distribution. Our results show that increasing the quantum generator size generally improves the learning capability of the network. The developed framework provides a foundation for future design of QGANs with optimal parameter set tailored for complex image generation tasks.
    Improving Dialogue Breakdown Detection with Semi-Supervised Learning. (arXiv:2011.00136v2 [cs.CL] UPDATED)
    Building user trust in dialogue agents requires smooth and consistent dialogue exchanges. However, agents can easily lose conversational context and generate irrelevant utterances. These situations are called dialogue breakdown, where agent utterances prevent users from continuing the conversation. Building systems to detect dialogue breakdown allows agents to recover appropriately or avoid breakdown entirely. In this paper we investigate the use of semi-supervised learning methods to improve dialogue breakdown detection, including continued pre-training on the Reddit dataset and a manifold-based data augmentation method. We demonstrate the effectiveness of these methods on the Dialogue Breakdown Detection Challenge (DBDC) English shared task. Our submissions to the 2020 DBDC5 shared task place first, beating baselines and other submissions by over 12\% accuracy. In ablations on DBDC4 data from 2019, our semi-supervised learning methods improve the performance of a baseline BERT model by 2\% accuracy. These methods are applicable generally to any dialogue task and provide a simple way to improve model performance.
    A Semi-supervised Sensing Rate Learning based CMAB Scheme to Combat COVID-19 by Trustful Data Collection in the Crowd. (arXiv:2301.08563v1 [cs.HC])
    Mobile CrowdSensing (MCS), through employing considerable workers to sense and collect data in a participatory manner, has been recognized as a promising paradigm for building many large-scale applications in a cost-effective way, such as combating COVID-19. The recruitment of trustworthy and high-quality workers is an important research issue for MCS. Previous studies assume that the qualities of workers are known in advance, or the platform knows the qualities of workers once it receives their collected data. In reality, to reduce their costs and thus maximize revenue, many strategic workers do not perform their sensing tasks honestly and report fake data to the platform. So, it is very hard for the platform to evaluate the authenticity of the received data. In this paper, an incentive mechanism named Semi-supervision based Combinatorial Multi-Armed Bandit reverse Auction (SCMABA) is proposed to solve the recruitment problem of multiple unknown and strategic workers in MCS. First, we model the worker recruitment as a multi-armed bandit reverse auction problem, and design an UCB-based algorithm to separate the exploration and exploitation, considering the Sensing Rates (SRs) of recruited workers as the gain of the bandit. Next, a Semi-supervised Sensing Rate Learning (SSRL) approach is proposed to quickly and accurately obtain the workers' SRs, which consists of two phases, supervision and self-supervision. Last, SCMABA is designed organically combining the SRs acquisition mechanism with multi-armed bandit reverse auction, where supervised SR learning is used in the exploration, and the self-supervised one is used in the exploitation. We prove that our SCMABA achieves truthfulness and individual rationality. Additionally, we exhibit outstanding performances of the SCMABA mechanism through in-depth simulations of real-world data traces.
    Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization. (arXiv:2105.15186v3 [math.OC] UPDATED)
    This paper investigates the problem of computing the equilibrium of competitive games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE) -- which are solutions to zero-sum two-player matrix games with entropy regularization -- at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent's actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence.
    A shallow physics-informed neural network for solving partial differential equations on surfaces. (arXiv:2203.01581v2 [math.NA] UPDATED)
    In this paper, we introduce a shallow (one-hidden-layer) physics-informed neural network for solving partial differential equations on static and evolving surfaces. For the static surface case, with the aid of level set function, the surface normal and mean curvature used in the surface differential expressions can be computed easily. So instead of imposing the normal extension constraints used in literature, we write the surface differential operators in the form of traditional Cartesian differential operators and use them in the loss function directly. We perform a series of performance study for the present methodology by solving Laplace-Beltrami equation and surface diffusion equation on complex static surfaces. With just a moderate number of neurons used in the hidden layer, we are able to attain satisfactory prediction results. Then we extend the present methodology to solve the advection-diffusion equation on an evolving surface with given velocity. To track the surface, we additionally introduce a prescribed hidden layer to enforce the topological structure of the surface and use the network to learn the homeomorphism between the surface and the prescribed topology. The proposed network structure is designed to track the surface and solve the equation simultaneously. Again, the numerical results show comparable accuracy as the static cases. As an application, we simulate the surfactant transport on the droplet surface under shear flow and obtain some physically plausible results.
    Metric Residual Networks for Sample Efficient Goal-Conditioned Reinforcement Learning. (arXiv:2208.08133v4 [cs.LG] UPDATED)
    Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.
    Predicting the Masses of Exotic Hadrons with Data Augmentation Using Multilayer Perceptron. (arXiv:2208.09538v2 [hep-ph] UPDATED)
    Recently, there have been significant developments in neural networks, which led to the frequent use of neural networks in the physics literature. This work is focused on predicting the masses of exotic hadrons, doubly charmed and bottomed baryons using neural networks trained on meson and baryon masses that are determined by experiments. The original data set has been extended using the recently proposed artificial data augmentation methods. We have observed that the neural network's predictive ability increases with the use of augmented data. The results indicated that data augmentation techniques play an essential role in improving neural network predictions; moreover, neural networks can make reasonable predictions for exotic hadrons, doubly charmed, and doubly bottomed baryons. The results are also comparable to Gaussian Process and Constituent Quark Model.
    Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV. (arXiv:2208.08655v2 [cs.LG] UPDATED)
    Clinical data usually cannot be freely distributed due to their highly confidential nature and this hampers the development of machine learning in the healthcare domain. One way to mitigate this problem is by generating realistic synthetic datasets using generative adversarial networks (GANs). However, GANs are known to suffer from mode collapse thus creating outputs of low diversity. This lowers the quality of the synthetic healthcare data, and may cause it to omit patients of minority demographics or neglect less common clinical practices. In this paper, we extend the classic GAN setup with an additional variational autoencoder (VAE) and include an external memory to replay latent features observed from the real samples to the GAN generator. Using antiretroviral therapy for human immunodeficiency virus (ART for HIV) as a case study, we show that our extended setup overcomes mode collapse and generates a synthetic dataset that accurately describes severely imbalanced class distributions commonly found in real-world clinical variables. In addition, we demonstrate that our synthetic dataset is associated with a very low patient disclosure risk, and that it retains a high level of utility from the ground truth dataset to support the development of downstream machine learning algorithms.
    Revisiting consistency for semi-supervised semantic segmentation. (arXiv:2106.07075v5 [cs.CV] UPDATED)
    Semi-supervised learning an attractive technique in practical deployments of deep models since it relaxes the dependence on labeled data. It is especially important in the scope of dense prediction because pixel-level annotation requires significant effort. This paper considers semi-supervised algorithms that enforce consistent predictions over perturbed unlabeled inputs. We study the advantages of perturbing only one of the two model instances and preventing the backward pass through the unperturbed instance. We also propose a competitive perturbation model as a composition of geometric warp and photometric jittering. We experiment with efficient models due to their importance for real-time and low-power applications. Our experiments show clear advantages of (1) one-way consistency, (2) perturbing only the student branch, and (3) strong photometric and geometric perturbations. Our perturbation model outperforms recent work and most of the contribution comes from photometric component. Experiments with additional data from the large coarsely annotated subset of Cityscapes suggest that semi-supervised training can outperform supervised training with the coarse labels.
    Multimodal Frame-Scoring Transformer for Video Summarization. (arXiv:2207.01814v3 [cs.LG] UPDATED)
    As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator used for extracting text information from a video and to train the multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST), a framework exploiting visual, text, and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (audio-visual-text) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses multimodal representation based on extracted features as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method by a large margin in both F1 score and Rank-based evaluation.
    Online Estimation of Network Point Processes for Event Streams. (arXiv:2009.01742v2 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.
    LaF: Labeling-Free Model Selection for Automated Deep Neural Network Reusing. (arXiv:2204.03994v2 [cs.LG] UPDATED)
    Applying deep learning to science is a new trend in recent years which leads DL engineering to become an important problem. Although training data preparation, model architecture design, and model training are the normal processes to build DL models, all of them are complex and costly. Therefore, reusing the open-sourced pre-trained model is a practical way to bypass this hurdle for developers. Given a specific task, developers can collect massive pre-trained deep neural networks from public sources for re-using. However, testing the performance (e.g., accuracy and robustness) of multiple DNNs and recommending which model should be used is challenging regarding the scarcity of labeled data and the demand for domain expertise. In this paper, we propose a labeling-free (LaF) model selection approach to overcome the limitations of labeling efforts for automated model reusing. The main idea is to statistically learn a Bayesian model to infer the models' specialty only based on predicted labels. We evaluate LaF using 9 benchmark datasets including image, text, and source code, and 165 DNNs, considering both the accuracy and robustness of models. The experimental results demonstrate that LaF outperforms the baseline methods by up to 0.74 and 0.53 on Spearman's correlation and Kendall's $\tau$, respectively.
    CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models. (arXiv:2212.01282v2 [eess.AS] UPDATED)
    Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.
    StratDef: Strategic Defense Against Adversarial Attacks in ML-based Malware Detection. (arXiv:2202.07568v4 [cs.LG] UPDATED)
    Over the years, most research towards defenses against adversarial attacks on machine learning models has been in the image recognition domain. The malware detection domain has received less attention despite its importance. Moreover, most work exploring these defenses has focused on several methods but with no strategy when applying them. In this paper, we introduce StratDef, which is a strategic defense system based on a moving target defense approach. We overcome challenges related to the systematic construction, selection, and strategic use of models to maximize adversarial robustness. StratDef dynamically and strategically chooses the best models to increase the uncertainty for the attacker while minimizing critical aspects in the adversarial ML domain, like attack transferability. We provide the first comprehensive evaluation of defenses against adversarial attacks on machine learning for malware detection, where our threat model explores different levels of threat, attacker knowledge, capabilities, and attack intensities. We show that StratDef performs better than other defenses even when facing the peak adversarial threat. We also show that, of the existing defenses, only a few adversarially-trained models provide substantially better protection than just using vanilla models but are still outperformed by StratDef.
    Enforcing the consensus between Trajectory Optimization and Policy Learning for precise robot control. (arXiv:2209.09006v2 [cs.RO] UPDATED)
    Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages. On one hand, RL approaches are able to learn global control policies directly from data, but generally require large sample sizes to properly converge towards feasible policies. On the other hand, TO methods are able to exploit gradient-based information extracted from simulators to quickly converge towards a locally optimal control trajectory which is only valid within the vicinity of the solution. Over the past decade, several approaches have aimed to adequately combine the two classes of methods in order to obtain the best of both worlds. Following on from this line of research, we propose several improvements on top of these approaches to learn global control policies quicker, notably by leveraging sensitivity information stemming from TO methods via Sobolev learning, and augmented Lagrangian techniques to enforce the consensus between TO and policy learning. We evaluate the benefits of these improvements on various classical tasks in robotics through comparison with existing approaches in the literature.
    Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models. (arXiv:2212.07398v2 [cs.LG] UPDATED)
    Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. To adapt the policy to unseen tasks and environments, we explore a new paradigm on leveraging the pre-trained foundation models with Self-PLAY and Self-Describe (SPLAYD). When deploying the trained policy to a new task or a new environment, we first let the policy self-play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to accurately self-describe (i.e., re-label or classify) the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show SPLAYD improves baselines by a large margin in all cases. Our project page is available at https://geyuying.github.io/SPLAYD/
    Mixed-Integer Optimization with Constraint Learning. (arXiv:2111.04469v2 [math.OC] UPDATED)
    We establish a broad methodological foundation for mixed-integer optimization with learned constraints. We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning, and the trained models are embedded in an optimization formulation. We exploit the mixed-integer optimization-representability of many machine learning methods, including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture various underlying relationships between decisions, contextual variables, and outcomes. We also introduce two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrapolation. We efficiently incorporate this representation using column generation and propose a more flexible formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple algorithms. In combination with domain-driven components, the embedded models and trust region define a mixed-integer optimization problem for prescription generation. We implement this framework as a Python package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning and chemotherapy optimization. The case studies illustrate the framework's ability to generate high-quality prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness, the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.
    Offline Policy Evaluation with Out-of-Sample Guarantees. (arXiv:2301.08649v1 [stat.ML])
    We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss or disutility (or negative reward) and the problem is to draw valid inferences about the out-of-sample loss of the specified policy when the past data is observed under a, possibly unknown, policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finite-sample coverage guarantees that evaluate the entire loss distribution. Importantly, the method takes into account model misspecifications of the past policy -- including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under an explicitly specified range of credible model assumptions.
    Evaluating the Evaluators: Which UDA validation methods are most effective? Can they be improved?. (arXiv:2208.07360v2 [cs.CV] UPDATED)
    This paper compares and ranks 8 UDA validation methods. Validators estimate model accuracy, which makes them an essential component of any UDA train-test pipeline. We rank these validators to indicate which of them are most useful for the purpose of selecting optimal model checkpoints and hyperparameters. To the best of our knowledge, this large-scale benchmark study is the first of its kind in the UDA field. In addition, we propose three new validators that outperform all the existing checkpoint-based validators that we were able to find in the existing literature. Code is available at https://www.github.com/KevinMusgrave/powerful-benchmarker.
    DIAMOND: Taming Sample and Communication Complexities in Decentralized Bilevel Optimization. (arXiv:2212.02376v4 [cs.LG] UPDATED)
    Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.
    AccDecoder: Accelerated Decoding for Neural-enhanced Video Analytics. (arXiv:2301.08664v1 [cs.CV])
    The quality of the video stream is key to neural network-based video analytics. However, low-quality video is inevitably collected by existing surveillance systems because of poor quality cameras or over-compressed/pruned video streaming protocols, e.g., as a result of upstream bandwidth limit. To address this issue, existing studies use quality enhancers (e.g., neural super-resolution) to improve the quality of videos (e.g., resolution) and eventually ensure inference accuracy. Nevertheless, directly applying quality enhancers does not work in practice because it will introduce unacceptable latency. In this paper, we present AccDecoder, a novel accelerated decoder for real-time and neural-enhanced video analytics. AccDecoder can select a few frames adaptively via Deep Reinforcement Learning (DRL) to enhance the quality by neural super-resolution and then up-scale the unselected frames that reference them, which leads to 6-21% accuracy improvement. AccDecoder provides efficient inference capability via filtering important frames using DRL for DNN-based inference and reusing the results for the other frames via extracting the reference relationship among frames and blocks, which results in a latency reduction of 20-80% than baselines.
    Interpretable bilinear attention network with domain adaptation improves drug-target prediction. (arXiv:2208.02194v2 [cs.LG] UPDATED)
    Predicting drug-target interaction is key for drug discovery. Recent deep learning-based methods show promising performance but two challenges remain: (i) how to explicitly model and learn local interactions between drugs and targets for better prediction and interpretation; (ii) how to generalize prediction performance on novel drug-target pairs from different distribution. In this work, we propose DrugBAN, a deep bilinear attention network (BAN) framework with domain adaptation to explicitly learn pair-wise local interactions between drugs and targets, and adapt on out-of-distribution data. DrugBAN works on drug molecular graphs and target protein sequences to perform prediction, with conditional domain adversarial learning to align learned interaction representations across different distributions for better generalization on novel drug-target pairs. Experiments on three benchmark datasets under both in-domain and cross-domain settings show that DrugBAN achieves the best overall performance against five state-of-the-art baselines. Moreover, visualizing the learned bilinear attention map provides interpretable insights from prediction results.
    Online Decision Making for Trading Wind Energy. (arXiv:2209.02009v2 [cs.LG] UPDATED)
    This paper proposes and develops a new algorithm for trading wind energy in electricity markets, within an online learning and optimization framework. In particular, we combine a component-wise adaptive variant of the gradient descent algorithm with recent advances in the feature-driven newsvendor model. This results in an online offering approach capable of leveraging data-rich environments, while adapting to non-stationary characteristics of energy generation and electricity markets, and with a minimal computational burden. The performance of our approach is analyzed based on several numerical experiments, showing both better adaptability to non-stationary uncertain parameters and significant economic gains.
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v4 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to the estimation problems of sparse covariance matrix estimation, sparse linear regression (i.e., compressed sensing), and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, where the underlying distribution of heavy-tailed data is assumed to have bounded moments of some order. We propose new estimators based on 1-bit quantized data. In sub-Gaussian regime, our estimators achieve near minimax rates, indicating that our quantization scheme costs very little. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in an 1-bit quantized and heavy-tailed setting, or already improve on existing comparable results from some respect. Under the observations in our setting, the rates are almost tight in compressed sensing and matrix completion. Our 1-bit compressed sensing results feature general sensing vector that is sub-Gaussian or even heavy-tailed. We also first investigate a novel setting where both the covariate and response are quantized. In addition, our approach to 1-bit matrix completion does not rely on likelihood and represent the first method robust to pre-quantization noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.
    On Image Segmentation With Noisy Labels: Characterization and Volume Properties of the Optimal Solutions to Accuracy and Dice. (arXiv:2206.06484v3 [cs.CV] UPDATED)
    We study two of the most popular performance metrics in medical image segmentation, Accuracy and Dice, when the target labels are noisy. For both metrics, several statements related to characterization and volume properties of the set of optimal segmentations are proved, and associated experiments are provided. Our main insights are: (i) the volume of the solutions to both metrics may deviate significantly from the expected volume of the target, (ii) the volume of a solution to Accuracy is always less than or equal to the volume of a solution to Dice and (iii) the optimal solutions to both of these metrics coincide when the set of feasible segmentations is constrained to the set of segmentations with the volume equal to the expected volume of the target.
    Learning from non-irreducible Markov chains. (arXiv:2110.04338v2 [math.ST] UPDATED)
    Mostof the existing literature on supervised machine learning problems focuses on the case when the training data set is drawn from an i.i.d. sample. However, many practical problems are characterized by temporal dependence and strong correlation between the marginals of the data-generating process, suggesting that the i.i.d. assumption is not always justified. This problem has been already considered in the context of Markov chains satisfying the Doeblin condition. This condition, among other things, implies that the chain is not singular in its behavior, i.e. it is irreducible. In this article, we focus on the case when the training data set is drawn from a not necessarily irreducible Markov chain. Under the assumption that the chain is uniformly ergodic with respect to the $\mathrm{L}^1$-Wasserstein distance, and certain regularity assumptions on the hypothesis class and the state space of the chain, we first obtain a uniform convergence result for the corresponding sample error, and then we conclude learnability of the approximate sample error minimization algorithm and find its generalization bounds. At the end, a relative uniform convergence result for the sample error is also discussed.
    Asynchronous, Option-Based Multi-Agent Policy Gradient: A Conditional Reasoning Approach. (arXiv:2203.15925v2 [cs.RO] UPDATED)
    Multi-agent policy gradient methods have demonstrated success in games and robotics but are often limited to problems with low-level action space. However, when agents take higher-level, temporally-extended actions (i.e. options), when and how to derive a centralized control policy, its gradient as well as sampling options for all agents while not interrupting current option executions, becomes a challenge. This is mostly because agents may choose and terminate their options \textit{asynchronously}. In this work, we propose a conditional reasoning approach to address this problem, and empirically validate its effectiveness on representative option-based multi-agent cooperative tasks.
    Coupled Physics-informed Neural Networks for Inferring Solutions of Partial Differential Equations with Unknown Source Terms. (arXiv:2301.08618v1 [cs.LG])
    Physics-informed neural networks (PINNs) provide a transformative development for approximating the solutions to partial differential equations (PDEs). This work proposes a coupled physics-informed neural network (C-PINN) for the nonhomogeneous PDEs with unknown dynamical source terms, which is used to describe the systems with external forces and cannot be well approximated by the existing PINNs. In our method, two neural networks, NetU and NetG, are proposed. NetU is constructed to generate a quasi-solution satisfying PDEs under study. NetG is used to regularize the training of NetU. Then, the two networks are integrated into a data-physics-hybrid cost function. Finally, we propose a hierarchical training strategy to optimize and couple the two networks. The performance of C-PINN is proved by approximating several classical PDEs.
    STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID-19. (arXiv:2301.08648v1 [cs.LG])
    Human mobility estimation is crucial during the COVID-19 pandemic due to its significant guidance for policymakers to make non-pharmaceutical interventions. While deep learning approaches outperform conventional estimation techniques on tasks with abundant training data, the continuously evolving pandemic poses a significant challenge to solving this problem due to data nonstationarity, limited observations, and complex social contexts. Prior works on mobility estimation either focus on a single city or lack the ability to model the spatio-temporal dependencies across cities and time periods. To address these issues, we make the first attempt to tackle the cross-city human mobility estimation problem through a deep meta-generative framework. We propose a Spatio-Temporal Meta-Generative Adversarial Network (STORM-GAN) model that estimates dynamic human mobility responses under a set of social and policy conditions related to COVID-19. Facilitated by a novel spatio-temporal task-based graph (STTG) embedding, STORM-GAN is capable of learning shared knowledge from a spatio-temporal distribution of estimation tasks and quickly adapting to new cities and time periods with limited training samples. The STTG embedding component is designed to capture the similarities among cities to mitigate cross-task heterogeneity. Experimental results on real-world data show that the proposed approach can greatly improve estimation performance and out-perform baselines.
    Learning Sequential Latent Variable Models from Multimodal Time Series Data. (arXiv:2204.10419v2 [cs.LG] UPDATED)
    Sequential modelling of high-dimensional data is an important problem that appears in many domains including model-based reinforcement learning and dynamics identification for control. Latent variable models applied to sequential data (i.e., latent dynamics models) have been shown to be a particularly effective probabilistic approach to solve this problem, especially when dealing with images. However, in many application areas (e.g., robotics), information from multiple sensing modalities is available -- existing latent dynamics methods have not yet been extended to effectively make use of such multimodal sequential data. Multimodal sensor streams can be correlated in a useful manner and often contain complementary information across modalities. In this work, we present a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data and the respective dynamics. Using synthetic and real-world datasets from a multimodal robotic planar pushing task, we demonstrate that our approach leads to significant improvements in prediction and representation quality. Furthermore, we compare to the common learning baseline of concatenating each modality in the latent space and show that our principled probabilistic formulation performs better. Finally, despite being fully self-supervised, we demonstrate that our method is nearly as effective as an existing supervised approach that relies on ground truth labels.
    Source-free Subject Adaptation for EEG-based Visual Recognition. (arXiv:2301.08448v1 [eess.SP])
    This paper focuses on subject adaptation for EEG-based visual recognition. It aims at building a visual stimuli recognition system customized for the target subject whose EEG samples are limited, by transferring knowledge from abundant data of source subjects. Existing approaches consider the scenario that samples of source subjects are accessible during training. However, it is often infeasible and problematic to access personal biological data like EEG signals due to privacy issues. In this paper, we introduce a novel and practical problem setup, namely source-free subject adaptation, where the source subject data are unavailable and only the pre-trained model parameters are provided for subject adaptation. To tackle this challenging problem, we propose classifier-based data generation to simulate EEG samples from source subjects using classifier responses. Using the generated samples and target subject data, we perform subject-independent feature learning to exploit the common knowledge shared across different subjects. Notably, our framework is generalizable and can adopt any subject-independent learning method. In the experiments on the EEG-ImageNet40 benchmark, our model brings consistent improvements regardless of the choice of subject-independent learning. Also, our method shows promising performance, recording top-1 test accuracy of 74.6% under the 5-shot setting even without relying on source data. Our code can be found at https://github.com/DeepBCI/Deep-BCI/tree/master/1_Intelligent_BCI/Source_Free_Subject_Adaptation_for_EEG.
    SamBaS: Sampling-Based Stochastic Block Partitioning. (arXiv:2108.06651v2 [cs.SI] UPDATED)
    Community detection is a well-studied problem with applications in domains ranging from networking to bioinformatics. Due to the rapid growth in the volume of real-world data, there is growing interest in accelerating contemporary community detection algorithms. However, the more accurate and statistically robust methods tend to be hard to parallelize. One such method is stochastic block partitioning (SBP) - a community detection algorithm that works well on graphs with complex and heterogeneous community structure. In this paper, we present a sampling-based SBP (SamBaS) for accelerating SBP on sparse graphs. We characterize how various graph parameters affect the speedup and result quality of community detection with SamBaS and quantify the trade-offs therein. To evaluate SamBas on real-world web graphs without known ground-truth communities, we introduce partition quality score (PQS), an evaluation metric that outperforms modularity in terms of correlation with F1 score. Overall, SamBaS achieves speedups of up to 10X while maintaining result quality (and even improving result quality by over 150% on certain graphs, relative to F1 score).
    Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning. (arXiv:2301.08502v1 [cs.LG])
    In Model-based Reinforcement Learning (MBRL), model learning is critical since an inaccurate model can bias policy learning via generating misleading samples. However, learning an accurate model can be difficult since the policy is continually updated and the induced distribution over visited states used for model learning shifts accordingly. Prior methods alleviate this issue by quantifying the uncertainty of model-generated samples. However, these methods only quantify the uncertainty passively after the samples were generated, rather than foreseeing the uncertainty before model trajectories fall into those highly uncertain regions. The resulting low-quality samples can induce unstable learning targets and hinder the optimization of the policy. Moreover, while being learned to minimize one-step prediction errors, the model is generally used to predict for multiple steps, leading to a mismatch between the objectives of model learning and model usage. To this end, we propose \emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout process as a sequential decision making problem by reversely considering the model as a decision maker and the current policy as the dynamics. In this way, the model can quickly adapt to the current policy and foresee the multi-step future uncertainty when generating trajectories. Theoretically, we show that the performance of P2P can be guaranteed by approximately optimizing a lower bound of the true environment return. Empirical results demonstrate that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
    Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning. (arXiv:2301.08442v1 [cs.LG])
    We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically, we show that a smaller learning rate, or, an adaptive learning rate, such as that used by Adam and RSMProp optimizers, makes the policy optimization robust to the bias. We further draw connections between optimizers and the optimization regularization to show that both the KL and the reverse KL regularization can significantly rectify this bias. Moreover, we provide extensive experiments on continuous control tasks to support our analysis. Our paper sheds light on how successful PG algorithms optimize policies in the DRL setting, and contributes insights into the practical issues in DRL.
    Promises and pitfalls of deep neural networks in neuroimaging-based psychiatric research. (arXiv:2301.08525v1 [cs.LG])
    By promising more accurate diagnostics and individual treatment recommendations, deep neural networks and in particular convolutional neural networks have advanced to a powerful tool in medical imaging. Here, we first give an introduction into methodological key concepts and resulting methodological promises including representation and transfer learning, as well as modelling domain-specific priors. After reviewing recent applications within neuroimaging-based psychiatric research, such as the diagnosis of psychiatric diseases, delineation of disease subtypes, normative modeling, and the development of neuroimaging biomarkers, we discuss current challenges. This includes for example the difficulty of training models on small, heterogeneous and biased data sets, the lack of validity of clinical labels, algorithmic bias, and the influence of confounding variables.
    Optimal Convergence Rates of Deep Convolutional Neural Networks: Additive Ridge Functions. (arXiv:2202.12119v2 [cs.LG] UPDATED)
    Convolutional neural networks have shown impressive abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates (up to a log factor). The input dimension only appears in the constant of convergence rates. This work shows the statistical optimality of convolutional neural networks and may shed light on why convolutional neural networks are able to behave well for high dimensional input.
    Predicting Surface Texture in Steel Manufacturing at Speed. (arXiv:2301.08527v1 [cs.LG])
    Control of the surface texture of steel strip during the galvanizing and temper rolling processes is essential to satisfy customer requirements and is conventionally measured post-production using a stylus. In-production laser reflection measurement is less consistent than physical measurement but enables real time adjustment of processing parameters to optimize product surface characteristics. We propose the use of machine learning to improve accuracy of the transformation from inline laser reflection measurements to a prediction of surface properties. In addition to accuracy, model evaluation speed is important for fast feedback control. The ROCKET model is one of the fastest state of the art models, however it can be sped up by utilizing a GPU. Our contribution is to implement the model in PyTorch for fast GPU kernel transforms and provide a soft version of the Proportion of Positive Values (PPV) nonlinear pooling function, allowing gradient flow. We perform timing and performance experiments comparing the implementations
    NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis. (arXiv:2301.08556v1 [cs.LG])
    Expert demonstrations are a rich source of supervision for training visual robotic manipulation policies, but imitation learning methods often require either a large number of demonstrations or expensive online expert supervision to learn reactive closed-loop behaviors. In this work, we introduce SPARTN (Synthetic Perturbations for Augmenting Robot Trajectories via NeRF): a fully-offline data augmentation scheme for improving robot policies that use eye-in-hand cameras. Our approach leverages neural radiance fields (NeRFs) to synthetically inject corrective noise into visual demonstrations, using NeRFs to generate perturbed viewpoints while simultaneously calculating the corrective actions. This requires no additional expert supervision or environment interaction, and distills the geometric information in NeRFs into a real-time reactive RGB-only policy. In a simulated 6-DoF visual grasping benchmark, SPARTN improves success rates by 2.8$\times$ over imitation learning without the corrective augmentations and even outperforms some methods that use online supervision. It additionally closes the gap between RGB-only and RGB-D success rates, eliminating the previous need for depth sensors. In real-world 6-DoF robotic grasping experiments from limited human demonstrations, our method improves absolute success rates by $22.5\%$ on average, including objects that are traditionally challenging for depth-based methods. See video results at \url{https://bland.website/spartn}.
    Spectral embedding of weighted graphs. (arXiv:1910.05534v4 [stat.ML] UPDATED)
    When analyzing weighted networks using spectral embedding, a judicious transformation of the edge weights may produce better results. To formalize this idea, we consider the asymptotic behavior of spectral embedding for different edge-weight representations, under a generic low rank model. We measure the quality of different embeddings -- which can be on entirely different scales -- by how easy it is to distinguish communities, in an information-theoretic sense. For common types of weighted graphs, such as count networks or p-value networks, we find that transformations such as tempering or thresholding can be highly beneficial, both in theory and in practice.
    ILLUME: Rationalizing Vision-Language Models through Human Interactions. (arXiv:2208.08241v3 [cs.LG] UPDATED)
    Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user's rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intend. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v2 [stat.ML] UPDATED)
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|w^*\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Introducing Expertise Logic into Graph Representation Learning from A Causal Perspective. (arXiv:2301.08496v1 [cs.LG])
    Benefiting from the injection of human prior knowledge, graphs, as derived discrete data, are semantically dense so that models can efficiently learn the semantic information from such data. Accordingly, graph neural networks (GNNs) indeed achieve impressive success in various fields. Revisiting the GNN learning paradigms, we discover that the relationship between human expertise and the knowledge modeled by GNNs still confuses researchers. To this end, we introduce motivating experiments and derive an empirical observation that the human expertise is gradually learned by the GNNs in general domains. By further observing the ramifications of introducing expertise logic into graph representation learning, we conclude that leading the GNNs to learn human expertise can improve the model performance. By exploring the intrinsic mechanism behind such observations, we elaborate the Structural Causal Model for the graph representation learning paradigm. Following the theoretical guidance, we innovatively introduce the auxiliary causal logic learning paradigm to improve the model to learn the expertise logic causally related to the graph representation learning task. In practice, the counterfactual technique is further performed to tackle the insufficient training issue during optimization. Plentiful experiments on the crafted and real-world domains support the consistent effectiveness of the proposed method.
    Interpretability Study on Deep Learning for Jet Physics at the Large Hadron Collider. (arXiv:1911.01872v1 [hep-ph] CROSS LISTED)
    Using deep neural networks for identifying physics objects at the Large Hadron Collider (LHC) has become a powerful alternative approach in recent years. After successful training of deep neural networks, examining the trained networks not only helps us understand the behaviour of neural networks, but also helps improve the performance of deep learning models through proper interpretation. We take jet tagging problem at the LHC as an example, using recursive neural networks as a starting point, aim at a thorough understanding of the behaviour of the physics-oriented DNNs and the information encoded in the embedding space. We make a comparative study on a series of different jet tagging tasks dominated by different underlying physics. Interesting observations on the latent space are obtained.
    Who Should I Engage with At What Time? A Missing Event Aware Temporal Graph Neural Network. (arXiv:2301.08399v1 [cs.LG])
    Temporal graph neural network has recently received significant attention due to its wide application scenarios, such as bioinformatics, knowledge graphs, and social networks. There are some temporal graph neural networks that achieve remarkable results. However, these works focus on future event prediction and are performed under the assumption that all historical events are observable. In real-world applications, events are not always observable, and estimating event time is as important as predicting future events. In this paper, we propose MTGN, a missing event-aware temporal graph neural network, which uniformly models evolving graph structure and timing of events to support predicting what will happen in the future and when it will happen.MTGN models the dynamic of both observed and missing events as two coupled temporal point processes, thereby incorporating the effects of missing events into the network. Experimental results on several real-world temporal graphs demonstrate that MTGN significantly outperforms existing methods with up to 89% and 112% more accurate time and link prediction. Code can be found on https://github.com/HIT-ICES/TNNLS-MTGN.  ( 2 min )
    Pneumonia Detection in Chest X-Ray Images : Handling Class Imbalance. (arXiv:2301.08479v1 [eess.IV])
    People all over the globe are affected by pneumonia but deaths due to it are highest in Sub-Saharan Asia and South Asia. In recent years, the overall incidence and mortality rate of pneumonia regardless of the utilization of effective vaccines and compelling antibiotics has escalated. Thus, pneumonia remains a disease that needs spry prevention and treatment. The widespread prevalence of pneumonia has caused the research community to come up with a framework that helps detect, diagnose and analyze diseases accurately and promptly. One of the major hurdles faced by the Artificial Intelligence (AI) research community is the lack of publicly available datasets for chest diseases, including pneumonia . Secondly, few of the available datasets are highly imbalanced (normal examples are over sampled, while samples with ailment are in severe minority) making the problem even more challenging. In this article we present a novel framework for the detection of pneumonia. The novelty of the proposed methodology lies in the tackling of class imbalance problem. The Generative Adversarial Network (GAN), specifically a combination of Deep Convolutional Generative Adversarial Network (DCGAN) and Wasserstein GAN gradient penalty (WGAN-GP) was applied on the minority class ``Pneumonia'' for augmentation, whereas Random Under-Sampling (RUS) was done on the majority class ``No Findings'' to deal with the imbalance problem. The ChestX-Ray8 dataset, one of the biggest datasets, is used to validate the performance of the proposed framework. The learning phase is completed using transfer learning on state-of-the-art deep learning models i.e. ResNet-50, Xception, and VGG-16. Results obtained exceed state-of-the-art.  ( 2 min )
    Feature Relevance Analysis to Explain Concept Drift -- A Case Study in Human Activity Recognition. (arXiv:2301.08453v1 [cs.LG])
    This article studies how to detect and explain concept drift. Human activity recognition is used as a case study together with a online batch learning situation where the quality of the labels used in the model updating process starts to decrease. Drift detection is based on identifying a set of features having the largest relevance difference between the drifting model and a model that is known to be accurate and monitoring how the relevance of these features changes over time. As a main result of this article, it is shown that feature relevance analysis cannot only be used to detect the concept drift but also to explain the reason for the drift when a limited number of typical reasons for the concept drift are predefined. To explain the reason for the concept drift, it is studied how these predefined reasons effect to feature relevance. In fact, it is shown that each of these has an unique effect to features relevance and these can be used to explain the reason for concept drift.  ( 2 min )
    Self-Supervised Learning for Data Scarcity in a Fatigue Damage Prognostic Problem. (arXiv:2301.08441v1 [stat.ML])
    With the increasing availability of data for Prognostics and Health Management (PHM), Deep Learning (DL) techniques are now the subject of considerable attention for this application, often achieving more accurate Remaining Useful Life (RUL) predictions. However, one of the major challenges for DL techniques resides in the difficulty of obtaining large amounts of labelled data on industrial systems. To overcome this lack of labelled data, an emerging learning technique is considered in our work: Self-Supervised Learning, a sub-category of unsupervised learning approaches. This paper aims to investigate whether pre-training DL models in a self-supervised way on unlabelled sensors data can be useful for RUL estimation with only Few-Shots Learning, i.e. with scarce labelled data. In this research, a fatigue damage prognostics problem is addressed, through the estimation of the RUL of aluminum alloy panels (typical of aerospace structures) subject to fatigue cracks from strain gauge data. Synthetic datasets composed of strain data are used allowing to extensively investigate the influence of the dataset size on the predictive performance. Results show that the self-supervised pre-trained models are able to significantly outperform the non-pre-trained models in downstream RUL prediction task, and with less computational expense, showing promising results in prognostic tasks when only limited labelled data is available.  ( 2 min )
    Optimization of body configuration and joint-driven attitude stabilization for transformable spacecrafts under solar radiation pressure. (arXiv:2301.08435v1 [cs.LG])
    A solar sail is one of the most promising space exploration system because of its theoretically infinite specific impulse using solar radiation pressure (SRP). Recently, some researchers proposed "transformable spacecrafts" that can actively reconfigure their body configurations with actuatable joints. The transformable spacecrafts are expected to greatly enhance orbit and attitude control capability due to its high redundancy in control degree of freedom if they are used as solar sails. However, its large number of input poses difficulties in control, and therefore, previous researchers imposed strong constraints to limit its potential control capabilities. This paper addresses novel attitude control techniques for the transformable spacecrafts under SRP. The authors have constructed two proposed methods; one of those is a joint angle optimization to acquire arbitrary SRP force and torque, and the other is a momentum damping control driven by joint angle actuation. Our proposed methods are formulated in general forms and applicable to any transformable solar sail that consists of flat and thin body components. Validity of the proposed methods are confirmed by numerical simulations. This paper contributes to making most of the high control redundancy of transformable solar sails without consuming any expendable propellants, which is expected to greatly enhance orbit and attitude control capability.  ( 2 min )
    Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences. (arXiv:2301.08571v1 [cs.CL])
    Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent and have more narrativity compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and have more narrativity than stories generated with the current state-of-the-art model.
    Generative Slate Recommendation with Reinforcement Learning. (arXiv:2301.08632v1 [cs.IR])
    Recent research has employed reinforcement learning (RL) algorithms to optimize long-term user engagement in recommender systems, thereby avoiding common pitfalls such as user boredom and filter bubbles. They capture the sequential and interactive nature of recommendations, and thus offer a principled way to deal with long-term rewards and avoid myopic behaviors. However, RL approaches are intractable in the slate recommendation scenario - where a list of items is recommended at each interaction turn - due to the combinatorial action space. In that setting, an action corresponds to a slate that may contain any combination of items. While previous work has proposed well-chosen decompositions of actions so as to ensure tractability, these rely on restrictive and sometimes unrealistic assumptions. Instead, in this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder. Then, the RL agent selects continuous actions in this latent space, which are ultimately decoded into the corresponding slates. By doing so, we are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates instead of independent items, in particular by enabling diversity. Our experiments performed on a wide array of simulated environments confirm the effectiveness of our generative modeling of slates over baselines in practical scenarios where the restrictive assumptions underlying the baselines are lifted. Our findings suggest that representation learning using generative models is a promising direction towards generalizable RL-based slate recommendation.
    Autonomous Drug Design with Multi-Armed Bandits. (arXiv:2207.01393v2 [cs.LG] UPDATED)
    Recent developments in artificial intelligence and automation support a new drug design paradigm: autonomous drug design. Under this paradigm, generative models can provide suggestions on thousands of molecules with specific properties, and automated laboratories can potentially make, test and analyze molecules with minimal human supervision. However, since still only a limited number of molecules can be synthesized and tested, an obvious challenge is how to efficiently select among provided suggestions in a closed-loop system. We formulate this task as a stochastic multi-armed bandit problem with multiple plays, volatile arms and similarity information. To solve this task, we adapt previous work on multi-armed bandits to this setting, and compare our solution with random sampling, greedy selection and decaying-epsilon-greedy selection strategies. According to our simulation results, our approach has the potential to perform better exploration and exploitation of the chemical space for autonomous drug design.
    Ontology Pre-training for Poison Prediction. (arXiv:2301.08577v1 [cs.AI])
    Integrating human knowledge into neural networks has the potential to improve their robustness and interpretability. We have developed a novel approach to integrate knowledge from ontologies into the structure of a Transformer network which we call ontology pre-training: we train the network to predict membership in ontology classes as a way to embed the structure of the ontology into the network, and subsequently fine-tune the network for the particular prediction task. We apply this approach to a case study in predicting the potential toxicity of a small molecule based on its molecular structure, a challenging task for machine learning in life sciences chemistry. Our approach improves on the state of the art, and moreover has several additional benefits. First, we are able to show that the model learns to focus attention on more meaningful chemical groups when making predictions with ontology pre-training than without, paving a path towards greater robustness and interpretability. Second, the training time is reduced after ontology pre-training, indicating that the model is better placed to learn what matters for toxicity prediction with the ontology pre-training than without. This strategy has general applicability as a neuro-symbolic approach to embed meaningful semantics into neural networks.
    Baechi: Fast Device Placement of Machine Learning Graphs. (arXiv:2301.08695v1 [cs.DC])
    Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits.
    Neural Architecture Search: Insights from 1000 Papers. (arXiv:2301.08727v1 [cs.LG])
    In the past decade, advances in deep learning have resulted in breakthroughs in a variety of areas, including computer vision, natural language understanding, speech recognition, and reinforcement learning. Specialized, high-performing neural architectures are crucial to the success of deep learning in these areas. Neural architecture search (NAS), the process of automating the design of neural architectures for a given task, is an inevitable next step in automating machine learning and has already outpaced the best human-designed architectures on many tasks. In the past few years, research in NAS has been progressing rapidly, with over 1000 papers released since 2020. In this survey, we provide an organized and comprehensive guide to neural architecture search. We give a taxonomy of search spaces, algorithms, and speedup techniques, and we discuss resources such as benchmarks, best practices, other surveys, and open-source libraries.
    Explainable prediction of Qcodes for NOTAMs using column generation. (arXiv:2208.04955v2 [cs.LG] UPDATED)
    A NOtice To AirMen (NOTAM) contains important flight route related information. To search and filter them, NOTAMs are grouped into categories called QCodes. In this paper, we develop a tool to predict, with some explanations, a Qcode for a NOTAM. We present a way to extend the interpretable binary classification using column generation proposed in Dash, Gunluk, and Wei (2018) to a multiclass text classification method. We describe the techniques used to tackle the issues related to one vs-rest classification, such as multiple outputs and class imbalances. Furthermore, we introduce some heuristics, including the use of a CP-SAT solver for the subproblems, to reduce the training time. Finally, we show that our approach compares favorably with state-of-the-art machine learning algorithms like Linear SVM and small neural networks while adding the needed interpretability component.
    FewSOME: Few Shot Anomaly Detection. (arXiv:2301.06957v2 [cs.LG] UPDATED)
    Recent years have seen considerable progress in the field of Anomaly Detection but at the cost of increasingly complex training pipelines. Such techniques require large amounts of training data, resulting in computationally expensive algorithms. We propose Few Shot anomaly detection (FewSOME), a deep One-Class Anomaly Detection algorithm with the ability to accurately detect anomalies having trained on 'few' examples of the normal class and no examples of the anomalous class. We describe FewSOME to be of low complexity given its low data requirement and short training time. FewSOME is aided by pretrained weights with an architecture based on Siamese Networks. By means of an ablation study, we demonstrate how our proposed loss, 'Stop Loss', improves the robustness of FewSOME. Our experiments demonstrate that FewSOME performs at state-of-the-art level on benchmark datasets MNIST, CIFAR-10, F-MNIST and MVTec AD while training on only 30 normal samples, a minute fraction of the data that existing methods are trained on. Most notably, we found that FewSOME outperforms even highly complex models in the setting where only few examples of the normal class exist. Moreover, our extensive experiments show FewSOME to be robust to contaminated datasets. We also report F1 score and Balanced Accuracy in addition to AUC as a benchmark for future techniques to be compared against.
    Automated extraction of capacitive coupling for quantum dot systems. (arXiv:2301.08654v1 [cond-mat.mes-hall])
    Gate-defined quantum dots (QDs) have appealing attributes as a quantum computing platform, however, near-term devices possess a range of possible imperfections that need to be accounted for during the tuning and operation of QD devices. One such problem is the capacitive cross-talk between the metallic gates that define and control QD qubits. A way to compensate for the capacitive cross-talk and enable targeted control of specific QDs independent of coupling is by the use of virtual gates. Here, we demonstrate a reliable automated capacitive coupling identification method that combines machine learning with traditional fitting to take advantage of the desirable properties of each. We also show how the cross-capacitance measurement may be used for the identification of spurious QDs sometimes formed during tuning experimental devices. Our systems can autonomously flag devices with spurious dots near the operating regime which is crucial information for reliable tuning to a regime suitable for qubit operations.
    Accelerating Multi-Agent Planning Using Graph Transformers with Bounded Suboptimality. (arXiv:2301.08451v1 [cs.AI])
    Conflict-Based Search is one of the most popular methods for multi-agent path finding. Though it is complete and optimal, it does not scale well. Recent works have been proposed to accelerate it by introducing various heuristics. However, whether these heuristics can apply to non-grid-based problem settings while maintaining their effectiveness remains an open question. In this work, we find that the answer is prone to be no. To this end, we propose a learning-based component, i.e., the Graph Transformer, as a heuristic function to accelerate the planning. The proposed method is provably complete and bounded-suboptimal with any desired factor. We conduct extensive experiments on two environments with dense graphs. Results show that the proposed Graph Transformer can be trained in problem instances with relatively few agents and generalizes well to a larger number of agents, while achieving better performance than state-of-the-art methods.  ( 2 min )
    Asynchronously Trained Distributed Topographic Maps. (arXiv:2301.08379v1 [cs.LG])
    Topographic feature maps are low dimensional representations of data, that preserve spatial dependencies. Current methods of training such maps (e.g. self organizing maps - SOM, generative topographic maps) require centralized control and synchronous execution, which restricts scalability. We present an algorithm that uses $N$ autonomous units to generate a feature map by distributed asynchronous training. Unit autonomy is achieved by sparse interaction in time \& space through the combination of a distributed heuristic search, and a cascade-driven weight updating scheme governed by two rules: a unit i) adapts when it receives either a sample, or the weight vector of a neighbor, and ii) broadcasts its weight vector to its neighbors after adapting for a predefined number of times. Thus, a vector update can trigger an avalanche of adaptation. We map avalanching to a statistical mechanics model, which allows us to parametrize the statistical properties of cascading. Using MNIST, we empirically investigate the effect of the heuristic search accuracy and the cascade parameters on map quality. We also provide empirical evidence that algorithm complexity scales at most linearly with system size $N$. The proposed approach is found to perform comparably with similar methods in classification tasks across multiple datasets.  ( 2 min )
    Within-group fairness: A guidance for more sound between-group fairness. (arXiv:2301.08375v1 [stat.ML])
    As they have a vital effect on social decision-making, AI algorithms not only should be accurate and but also should not pose unfairness against certain sensitive groups (e.g., non-white, women). Various specially designed AI algorithms to ensure trained AI models to be fair between sensitive groups have been developed. In this paper, we raise a new issue that between-group fair AI models could treat individuals in a same sensitive group unfairly. We introduce a new concept of fairness so-called within-group fairness which requires that AI models should be fair for those in a same sensitive group as well as those in different sensitive groups. We materialize the concept of within-group fairness by proposing corresponding mathematical definitions and developing learning algorithms to control within-group fairness and between-group fairness simultaneously. Numerical studies show that the proposed learning algorithms improve within-group fairness without sacrificing accuracy as well as between-group fairness.  ( 2 min )
    Sequence Generation via Subsequence Similarity: Theory and Application to UAV Identification. (arXiv:2301.08403v1 [cs.LG])
    The ability to generate synthetic sequences is crucial for a wide range of applications, and recent advances in deep learning architectures and generative frameworks have greatly facilitated this process. Particularly, unconditional one-shot generative models constitute an attractive line of research that focuses on capturing the internal information of a single image, video, etc. to generate samples with similar contents. Since many of those one-shot models are shifting toward efficient non-deep and non-adversarial approaches, we examine the versatility of a one-shot generative model for augmenting whole datasets. In this work, we focus on how similarity at the subsequence level affects similarity at the sequence level, and derive bounds on the optimal transport of real and generated sequences based on that of corresponding subsequences. We use a one-shot generative model to sample from the vicinity of individual sequences and generate subsequence-similar ones and demonstrate the improvement of this approach by applying it to the problem of Unmanned Aerial Vehicle (UAV) identification using limited radio-frequency (RF) signals. In the context of UAV identification, RF fingerprinting is an effective method for distinguishing legitimate devices from malicious ones, but heterogenous environments and channel impairments can impose data scarcity and affect the performance of classification models. By using subsequence similarity to augment sequences of RF data with a low ratio (5\%-20\%) of training dataset, we achieve significant improvements in performance metrics such as accuracy, precision, recall, and F1 score.  ( 2 min )
    Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning. (arXiv:2301.08427v1 [cs.CL])
    The Bidirectional Encoder Representations from Transformers (BERT) were proposed in the natural language process (NLP) and shows promising results. Recently researchers applied the BERT to source-code representation learning and reported some good news on several downstream tasks. However, in this paper, we illustrated that current methods cannot effectively understand the logic of source codes. The representation of source code heavily relies on the programmer-defined variable and function names. We design and implement a set of experiments to demonstrate our conjecture and provide some insights for future works.  ( 2 min )
    Clustering Human Mobility with Multiple Spaces. (arXiv:2301.08524v1 [cs.LG])
    Human mobility clustering is an important problem for understanding human mobility behaviors (e.g., work and school commutes). Existing methods typically contain two steps: choosing or learning a mobility representation and applying a clustering algorithm to the representation. However, these methods rely on strict visiting orders in trajectories and cannot take advantage of multiple types of mobility representations. This paper proposes a novel mobility clustering method for mobility behavior detection. First, the proposed method contains a permutation-equivalent operation to handle sub-trajectories that might have different visiting orders but similar impacts on mobility behaviors. Second, the proposed method utilizes a variational autoencoder architecture to simultaneously perform clustering in both latent and original spaces. Also, in order to handle the bias of a single latent space, our clustering assignment prediction considers multiple learned latent spaces at different epochs. This way, the proposed method produces accurate results and can provide reliability estimates of each trajectory's cluster assignment. The experiment shows that the proposed method outperformed state-of-the-art methods in mobility behavior detection from trajectories with better accuracy and more interpretability.
    Regular Time-series Generation using SGM. (arXiv:2301.08518v1 [cs.LG])
    Score-based generative models (SGMs) are generative models that are in the spotlight these days. Time-series frequently occurs in our daily life, e.g., stock data, climate data, and so on. Especially, time-series forecasting and classification are popular research topics in the field of machine learning. SGMs are also known for outperforming other generative models. As a result, we apply SGMs to synthesize time-series data by learning conditional score functions. We propose a conditional score network for the time-series generation domain. Furthermore, we also derive the loss function between the score matching and the denoising score matching in the time-series generation domain. Finally, we achieve state-of-the-art results on real-world datasets in terms of sampling diversity and quality.
    Estimation of Large Financial Covariances: A Cross-Validation Approach. (arXiv:2012.05757v2 [stat.ML] UPDATED)
    We introduce a novel covariance estimator for portfolio selection that adapts to the non-stationary or persistent heteroskedastic environments of financial time series by employing exponentially weighted averages and nonlinearly shrinking the sample eigenvalues through cross-validation. Our estimator is structure agnostic, transparent, and computationally feasible in large dimensions. By correcting the biases in the sample eigenvalues and aligning our estimator to more recent risk, we demonstrate that our estimator performs well in large dimensions against existing state-of-the-art static and dynamic covariance shrinkage estimators through simulations and with an empirical application in active portfolio management.
    Optimizing model-agnostic Random Subspace ensembles. (arXiv:2109.03099v3 [cs.LG] UPDATED)
    This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. The degree of randomization in our parametric Random Subspace is thus automatically tuned through the optimization of the feature selection probabilities. This is an advantage over the standard Random Subspace approach, where the degree of randomization is controlled by a hyper-parameter. Furthermore, the optimized feature selection probabilities can be interpreted as feature importance scores. Our algorithm can also easily incorporate any differentiable regularization term to impose constraints on these importance scores.
    The Lost Art of Mathematical Modelling. (arXiv:2301.08559v1 [q-bio.OT])
    We provide a critique of mathematical biology in light of rapid developments in modern machine learning. We argue that out of the three modelling activities -- (1) formulating models; (2) analysing models; and (3) fitting or comparing models to data -- inherent to mathematical biology, researchers currently focus too much on activity (2) at the cost of (1). This trend, we propose, can be reversed by realising that any given biological phenomena can be modelled in an infinite number of different ways, through the adoption of an open/pluralistic approach. We explain the open approach using fish locomotion as a case study and illustrate some of the pitfalls -- universalism, creating models of models, etc. -- that hinder mathematical biology. We then ask how we might rediscover a lost art: that of creative mathematical modelling. This article is dedicated to the memory of Edmund Crampin.
    Fair Credit Scorer through Bayesian Approach. (arXiv:2301.08412v1 [cs.LG])
    Machine learning currently plays an increasingly important role in people's lives in areas such as credit scoring, auto-driving, disease diagnosing, and insurance quoting. However, in many of these areas, machine learning models have performed unfair behaviors against some sub-populations, such as some particular groups of race, sex, and age. These unfair behaviors can be on account of the pre-existing bias in the training dataset due to historical and social factors. In this paper, we focus on a real-world application of credit scoring and construct a fair prediction model by introducing latent variables to remove the correlation between protected attributes, such as sex and age, with the observable feature inputs, including house and job. For detailed implementation, we apply Bayesian approaches, including the Markov Chain Monte Carlo simulation, to estimate our proposed fair model.
    Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement Learning. (arXiv:2301.08491v1 [cs.MA])
    Practical uses of Artificial Intelligence (AI) in the real world have demonstrated the importance of embedding moral choices into intelligent agents. They have also highlighted that defining top-down ethical constraints on AI according to any one type of morality is extremely challenging and can pose risks. A bottom-up learning approach may be more appropriate for studying and developing ethical behavior in AI agents. In particular, we believe that an interesting and insightful starting point is the analysis of emergent behavior of Reinforcement Learning (RL) agents that act according to a predefined set of moral rewards in social dilemmas. In this work, we present a systematic analysis of the choices made by intrinsically-motivated RL agents whose rewards are based on moral theories. We aim to design reward structures that are simplified yet representative of a set of key ethical systems. Therefore, we first define moral reward functions that distinguish between consequence- and norm-based agents, between morality based on societal norms or internal virtues, and between single- and mixed-virtue (e.g., multi-objective) methodologies. Then, we evaluate our approach by modeling repeated dyadic interactions between learning moral agents in three iterated social dilemma games (Prisoner's Dilemma, Volunteer's Dilemma and Stag Hunt). We analyze the impact of different types of morality on the emergence of cooperation, defection or exploitation, and the corresponding social outcomes. Finally, we discuss the implications of these findings for the development of moral agents in artificial and mixed human-AI societies.
    Open-Set Likelihood Maximization for Few-Shot Learning. (arXiv:2301.08390v1 [cs.CV])
    We tackle the Few-Shot Open-Set Recognition (FSOSR) problem, i.e. classifying instances among a set of classes for which we only have a few labeled samples, while simultaneously detecting instances that do not belong to any known class. We explore the popular transductive setting, which leverages the unlabelled query instances at inference. Motivated by the observation that existing transductive methods perform poorly in open-set scenarios, we propose a generalization of the maximum likelihood principle, in which latent scores down-weighing the influence of potential outliers are introduced alongside the usual parametric model. Our formulation embeds supervision constraints from the support set and additional penalties discouraging overconfident predictions on the query set. We proceed with a block-coordinate descent, with the latent scores and parametric model co-optimized alternately, thereby benefiting from each other. We call our resulting formulation \textit{Open-Set Likelihood Optimization} (OSLO). OSLO is interpretable and fully modular; it can be applied on top of any pre-trained model seamlessly. Through extensive experiments, we show that our method surpasses existing inductive and transductive methods on both aspects of open-set recognition, namely inlier classification and outlier detection.  ( 2 min )
    Quantum HyperNetworks: Training Binary Neural Networks in Quantum Superposition. (arXiv:2301.08292v1 [quant-ph])
    Binary neural networks, i.e., neural networks whose parameters and activations are constrained to only two possible values, offer a compelling avenue for the deployment of deep learning models on energy- and memory-limited devices. However, their training, architectural design, and hyperparameter tuning remain challenging as these involve multiple computationally expensive combinatorial optimization problems. Here we introduce quantum hypernetworks as a mechanism to train binary neural networks on quantum computers, which unify the search over parameters, hyperparameters, and architectures in a single optimization loop. Through classical simulations, we demonstrate that of our approach effectively finds optimal parameters, hyperparameters and architectural choices with high probability on classification problems including a two-dimensional Gaussian dataset and a scaled-down version of the MNIST handwritten digits. We represent our quantum hypernetworks as variational quantum circuits, and find that an optimal circuit depth maximizes the probability of finding performant binary neural networks. Our unified approach provides an immense scope for other applications in the field of machine learning.  ( 2 min )
    AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction. (arXiv:2301.08353v1 [cs.IR])
    Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that different feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of different types and orders. To further improve the prediction accuracy and inference efficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of AdaEnsemble over state-of-the-art models.  ( 2 min )
    An Efficient Quadrature Sequence and Sparsifying Methodology for Mean-Field Variational Inference. (arXiv:2301.08374v1 [cs.LG])
    This work proposes a quasirandom sequence of quadratures for high-dimensional mean-field variational inference and a related sparsifying methodology. Each iterate of the sequence contains two evaluations points that combine to correctly integrate all univariate quadratic functions, as well as univariate cubics if the mean-field factors are symmetric. More importantly, averaging results over short subsequences achieves periodic exactness on a much larger space of multivariate polynomials of quadratic total degree. This framework is devised by first considering stochastic blocked mean-field quadratures, which may be useful in other contexts. By replacing pseudorandom sequences with quasirandom sequences, over half of all multivariate quadratic basis functions integrate exactly with only 4 function evaluations, and the exactness dimension increases for longer subsequences. Analysis shows how these efficient integrals characterize the dominant log-posterior contributions to mean-field variational approximations, including diagonal Hessian approximations, to support a robust sparsifying methodology in deep learning algorithms. A numerical demonstration of this approach on a simple Convolutional Neural Network for MNIST retains high test accuracy, 96.9%, while training over 98.9% of parameters to zero in only 10 epochs, bearing potential to reduce both storage and energy requirements for deep learning models.  ( 2 min )
    Deep Reinforcement Learning for Power Trading. (arXiv:2301.08360v1 [q-fin.TR])
    The Dutch power market includes a day-ahead market and an auction-like intraday balancing market. The varying supply and demand of power and its uncertainty induces an imbalance, which causes differing power prices in these two markets and creates an opportunity for arbitrage. In this paper, we present collaborative dual-agent reinforcement learning (RL) for bi-level simulation and optimization of European power arbitrage trading. Moreover, we propose two novel practical implementations specifically addressing the electricity power market. Leveraging the concept of imitation learning, the RL agent's reward is reformed by taking into account prior domain knowledge results in better convergence during training and, moreover, improves and generalizes performance. In addition, tranching of orders improves the bidding success rate and significantly raises the P&L. We show that each method contributes significantly to the overall performance uplifting, and the integrated methodology achieves about three-fold improvement in cumulative P&L over the original agent, as well as outperforms the highest benchmark policy by around 50% while exhibits efficient computational performance.  ( 2 min )
    Evaluation of the potential of Near Infrared Hyperspectral Imaging for monitoring the invasive brown marmorated stink bug. (arXiv:2301.08252v1 [eess.IV])
    The brown marmorated stink bug (BMSB), Halyomorpha halys, is an invasive insect pest of global importance that damages several crops, compromising agri-food production. Field monitoring procedures are fundamental to perform risk assessment operations, in order to promptly face crop infestations and avoid economical losses. To improve pest management, spectral cameras mounted on Unmanned Aerial Vehicles (UAVs) and other Internet of Things (IoT) devices, such as smart traps or unmanned ground vehicles, could be used as an innovative technology allowing fast, efficient and real-time monitoring of insect infestations. The present study consists in a preliminary evaluation at the laboratory level of Near Infrared Hyperspectral Imaging (NIR-HSI) as a possible technology to detect BMSB specimens on different vegetal backgrounds, overcoming the problem of BMSB mimicry. Hyperspectral images of BMSB were acquired in the 980-1660 nm range, considering different vegetal backgrounds selected to mimic a real field application scene. Classification models were obtained following two different chemometric approaches. The first approach was focused on modelling spectral information and selecting relevant spectral regions for discrimination by means of sparse-based variable selection coupled with Soft Partial Least Squares Discriminant Analysis (s-Soft PLS-DA) classification algorithm. The second approach was based on modelling spatial and spectral features contained in the hyperspectral images using Convolutional Neural Networks (CNN). Finally, to further improve BMSB detection ability, the two strategies were merged, considering only the spectral regions selected by s-Soft PLS-DA for CNN modelling.  ( 2 min )
    Causal conditional hidden Markov model for multimodal traffic prediction. (arXiv:2301.08249v1 [cs.LG])
    Multimodal traffic flow can reflect the health of the transportation system, and its prediction is crucial to urban traffic management. Recent works overemphasize spatio-temporal correlations of traffic flow, ignoring the physical concepts that lead to the generation of observations and their causal relationship. Spatio-temporal correlations are considered unstable under the influence of different conditions, and spurious correlations may exist in observations. In this paper, we analyze the physical concepts affecting the generation of multimode traffic flow from the perspective of the observation generation principle and propose a Causal Conditional Hidden Markov Model (CCHMM) to predict multimodal traffic flow. In the latent variables inference stage, a posterior network disentangles the causal representations of the concepts of interest from conditional information and observations, and a causal propagation module mines their causal relationship. In the data generation stage, a prior network samples the causal latent variables from the prior distribution and feeds them into the generator to generate multimodal traffic flow. We use a mutually supervised training method for the prior and posterior to enhance the identifiability of the model. Experiments on real-world datasets show that CCHMM can effectively disentangle causal representations of concepts of interest and identify causality, and accurately predict multimodal traffic flow.  ( 2 min )
    Advanced Scaling Methods for VNF deployment with Reinforcement Learning. (arXiv:2301.08325v1 [cs.NI])
    Network function virtualization (NFV) and software-defined network (SDN) have become emerging network paradigms, allowing virtualized network function (VNF) deployment at a low cost. Even though VNF deployment can be flexible, it is still challenging to optimize VNF deployment due to its high complexity. Several studies have approached the task as dynamic programming, e.g., integer linear programming (ILP). However, optimizing VNF deployment for highly complex networks remains a challenge. Alternatively, reinforcement learning (RL) based approaches have been proposed to optimize this task, especially to employ a scaling action-based method which can deploy VNFs within less computational time. However, the model architecture can be improved further to generalize to the different networking settings. In this paper, we propose an enhanced model which can be adapted to more general network settings. We adopt the improved GNN architecture and a few techniques to obtain a better node representation for the VNF deployment task. Furthermore, we apply a recently proposed RL method, phasic policy gradient (PPG), to leverage the shared representation of the service function chain (SFC) generation model from the value function. We evaluate the proposed method in various scenarios, achieving a better QoS with minimum resource utilization compared to the previous methods. Finally, as a qualitative evaluation, we analyze our proposed encoder's representation for the nodes, which shows a more disentangled representation.  ( 2 min )
    Forecasting subcritical cylinder wakes with Fourier Neural Operators. (arXiv:2301.08290v1 [physics.flu-dyn])
    We apply Fourier neural operators (FNOs), a state-of-the-art operator learning technique, to forecast the temporal evolution of experimentally measured velocity fields. FNOs are a recently developed machine learning method capable of approximating solution operators to systems of partial differential equations through data alone. The learned FNO solution operator can be evaluated in milliseconds, potentially enabling faster-than-real-time modeling for predictive flow control in physical systems. Here we use FNOs to predict how physical fluid flows evolve in time, training with particle image velocimetry measurements depicting cylinder wakes in the subcritical vortex shedding regime. We train separate FNOs at Reynolds numbers ranging from Re = 240 to Re = 3060 and study how increasingly turbulent flow phenomena impact prediction accuracy. We focus here on a short prediction horizon of ten non-dimensionalized time-steps, as would be relevant for problems of predictive flow control. We find that FNOs are capable of accurately predicting the evolution of experimental velocity fields throughout the range of Reynolds numbers tested (L2 norm error < 0.1) despite being provided with limited and imperfect flow observations. Given these results, we conclude that this method holds significant potential for real-time predictive flow control of physical systems.  ( 2 min )
    Investigating the Impact of Direct Punishment on the Emergence of Cooperation in Multi-Agent Reinforcement Learning Systems. (arXiv:2301.08278v1 [cs.MA])
    The problem of cooperation is of fundamental importance for human societies, with examples ranging from navigating road junctions to negotiating climate treaties. As the use of AI becomes more pervasive within society, the need for socially intelligent agents that are able to navigate these complex dilemmas is becoming increasingly evident. Direct punishment is an ubiquitous social mechanism that has been shown to benefit the emergence of cooperation within the natural world, however no prior work has investigated its impact on populations of learning agents. Moreover, although the use of all forms of punishment in the natural world is strongly coupled with partner selection and reputation, no existing work has provided a holistic analysis of their combination within multi-agent systems. In this paper, we present a comprehensive analysis of the behaviors and learning dynamics associated with direct punishment in multi-agent reinforcement learning systems and how this compares to third-party punishment, when both forms of punishment are combined with other social mechanisms such as partner selection and reputation. We provide an extensive and systematic evaluation of the impact of these key mechanisms on the emergence of cooperation. Finally, we discuss the implications of the use of these mechanisms in the design of cooperative AI systems.  ( 2 min )
  • Open

    Online Estimation of Network Point Processes for Event Streams. (arXiv:2009.01742v2 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.  ( 2 min )
    Neural Architecture Search: Insights from 1000 Papers. (arXiv:2301.08727v1 [cs.LG])
    In the past decade, advances in deep learning have resulted in breakthroughs in a variety of areas, including computer vision, natural language understanding, speech recognition, and reinforcement learning. Specialized, high-performing neural architectures are crucial to the success of deep learning in these areas. Neural architecture search (NAS), the process of automating the design of neural architectures for a given task, is an inevitable next step in automating machine learning and has already outpaced the best human-designed architectures on many tasks. In the past few years, research in NAS has been progressing rapidly, with over 1000 papers released since 2020. In this survey, we provide an organized and comprehensive guide to neural architecture search. We give a taxonomy of search spaces, algorithms, and speedup techniques, and we discuss resources such as benchmarks, best practices, other surveys, and open-source libraries.
    Bayesian Spatial Predictive Synthesis. (arXiv:2203.05197v3 [stat.ME] UPDATED)
    Spatial data are characterized by their spatial dependence, which is often complex, non-linear, and difficult to capture with a single model. Significant levels of model uncertainty -- arising from these characteristics -- cannot be resolved by model selection or simple ensemble methods. We address this issue by proposing a novel methodology that captures spatially varying model uncertainty, which we call Bayesian spatial predictive synthesis. Our proposal is derived by identifying the theoretically best approximate model under reasonable conditions, which is a latent factor spatially varying coefficient model in the Bayesian predictive synthesis framework. We then show that our proposed method produces exact minimax predictive distributions, providing finite sample guarantees. Two MCMC strategies are implemented for full uncertainty quantification, as well as a variational inference strategy for fast point inference. We also extend the estimation strategy for general responses. Through simulation examples and two real data applications, we demonstrate that our proposed spatial Bayesian predictive synthesis outperforms standard spatial models and advanced machine learning methods in terms of predictive accuracy.
    Multi armed bandits and quantum channel oracles. (arXiv:2301.08544v1 [quant-ph])
    Multi armed bandits are one of the theoretical pillars of reinforcement learning. Recently, the investigation of quantum algorithms for multi armed bandit problems was started, and it was found that a quadratic speed-up is possible when the arms and the randomness of the rewards of the arms can be queried in superposition. Here we introduce further bandit models where we only have limited access to the randomness of the rewards, but we can still query the arms in superposition. We show that this impedes any speed-up of quantum algorithms.
    AI-assisted neutron spectroscopy using active learning with log-Gaussian processes. (arXiv:2209.00980v2 [physics.data-an] UPDATED)
    To understand the origins of materials properties, neutron scattering experiments at three-axes spectrometers (TAS) investigate magnetic and lattice excitations in a sample by measuring intensity distributions in its momentum (Q) and energy (E) space. The high demand and limited availability of beam time for TAS experiments however raise the natural question whether we can improve their efficiency or make better use of the experimenter's time. In fact, using TAS, there are a number of scientific questions that require searching for signals of interest in a particular region of Q-E space, but when done manually, it is time consuming and inefficient since the measurement points may be placed in uninformative regions such as the background. Active learning is a promising general machine learning approach that allows to iteratively detect informative regions of signal autonomously, i.e., without human interference, thus avoiding unnecessary measurements and speeding up the experiment. In addition, the autonomous mode allows experimenters to focus on other relevant tasks in the meantime. The approach that we describe in this article exploits log-Gaussian processes which, due to the logarithmic transformation, have the largest approximation uncertainties in regions of signal. Maximizing uncertainty as an acquisition function hence directly yields locations for informative measurements. We demonstrate the benefits of our approach on outcomes of a real neutron experiment at the thermal TAS EIGER (PSI) as well as on results of a benchmark in a synthetic setting including numerous different excitations.
    Offline Policy Evaluation with Out-of-Sample Guarantees. (arXiv:2301.08649v1 [stat.ML])
    We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss or disutility (or negative reward) and the problem is to draw valid inferences about the out-of-sample loss of the specified policy when the past data is observed under a, possibly unknown, policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finite-sample coverage guarantees that evaluate the entire loss distribution. Importantly, the method takes into account model misspecifications of the past policy -- including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under an explicitly specified range of credible model assumptions.
    Intrinsic persistent homology via density-based metric learning. (arXiv:2012.07621v3 [stat.ML] UPDATED)
    We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.
    Spectral embedding of weighted graphs. (arXiv:1910.05534v4 [stat.ML] UPDATED)
    When analyzing weighted networks using spectral embedding, a judicious transformation of the edge weights may produce better results. To formalize this idea, we consider the asymptotic behavior of spectral embedding for different edge-weight representations, under a generic low rank model. We measure the quality of different embeddings -- which can be on entirely different scales -- by how easy it is to distinguish communities, in an information-theoretic sense. For common types of weighted graphs, such as count networks or p-value networks, we find that transformations such as tempering or thresholding can be highly beneficial, both in theory and in practice.
    Mixed-Integer Optimization with Constraint Learning. (arXiv:2111.04469v2 [math.OC] UPDATED)
    We establish a broad methodological foundation for mixed-integer optimization with learned constraints. We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning, and the trained models are embedded in an optimization formulation. We exploit the mixed-integer optimization-representability of many machine learning methods, including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture various underlying relationships between decisions, contextual variables, and outcomes. We also introduce two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrapolation. We efficiently incorporate this representation using column generation and propose a more flexible formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple algorithms. In combination with domain-driven components, the embedded models and trust region define a mixed-integer optimization problem for prescription generation. We implement this framework as a Python package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning and chemotherapy optimization. The case studies illustrate the framework's ability to generate high-quality prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness, the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v2 [stat.ML] UPDATED)
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|w^*\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Detection of Small Holes by the Scale-Invariant Robust Density-Aware Distance (RDAD) Filtration. (arXiv:2204.07821v2 [math.ST] UPDATED)
    A novel topological-data-analytical (TDA) method is proposed to distinguish, from noise, small holes surrounded by high-density regions of a probability density function. The proposed method is robust against additive noise and outliers. Traditional TDA tools, like those based on the distance filtration, often struggle to distinguish small features from noise, because both have short persistences. An alternative filtration, called the Robust Density-Aware Distance (RDAD) filtration, is proposed to prolong the persistences of small holes of high-density regions. This is achieved by weighting the distance function by the density in the sense of Bell et al. The concept of distance-to-measure is incorporated to enhance stability and mitigate noise. The persistence-prolonging property and robustness of the proposed filtration are rigorously established, and numerical experiments are presented to demonstrate the proposed filtration's utility in identifying small holes.
    Learning from non-irreducible Markov chains. (arXiv:2110.04338v2 [math.ST] UPDATED)
    Mostof the existing literature on supervised machine learning problems focuses on the case when the training data set is drawn from an i.i.d. sample. However, many practical problems are characterized by temporal dependence and strong correlation between the marginals of the data-generating process, suggesting that the i.i.d. assumption is not always justified. This problem has been already considered in the context of Markov chains satisfying the Doeblin condition. This condition, among other things, implies that the chain is not singular in its behavior, i.e. it is irreducible. In this article, we focus on the case when the training data set is drawn from a not necessarily irreducible Markov chain. Under the assumption that the chain is uniformly ergodic with respect to the $\mathrm{L}^1$-Wasserstein distance, and certain regularity assumptions on the hypothesis class and the state space of the chain, we first obtain a uniform convergence result for the corresponding sample error, and then we conclude learnability of the approximate sample error minimization algorithm and find its generalization bounds. At the end, a relative uniform convergence result for the sample error is also discussed.
    Weighted Sum-Rate Maximization With Causal Inference for Latent Interference Estimation. (arXiv:2211.08327v2 [cs.IT] UPDATED)
    The paper investigates the weighted sum-rate maximization (WSRM) problem with latent interfering sources outside the known network, whose power allocation policy is hidden from and uncontrollable to optimization. The paper extends the famous alternate optimization algorithm weighted minimum mean square error (WMMSE) [1] under a causal inference framework to tackle with WSRM. Specifically, with the possibility of power policy shifting in the hidden network, computing an iterating direction based only on the observed interference inherently implies that counterfactual is ignored in decision making. A method called synthetic control (SC) is used to estimate the counterfactual. For any link in the known network, SC constructs a convex combination of the interference on other links and uses it as an estimate for the counterfactual. Power iteration in the proposed SC-WMMSE is performed taking into account both the observed interference and its counterfactual. SC-WMMSE requires no more information than the original WMMSE in the optimization stage. To our best knowledge, this is the first paper explores the potential of SC in assisting mathematical optimization in addressing classic wireless optimization problems. Numerical results suggest the superiority of the SC-WMMSE over the original in both convergence and objective.
    An Efficient Quadrature Sequence and Sparsifying Methodology for Mean-Field Variational Inference. (arXiv:2301.08374v1 [cs.LG])
    This work proposes a quasirandom sequence of quadratures for high-dimensional mean-field variational inference and a related sparsifying methodology. Each iterate of the sequence contains two evaluations points that combine to correctly integrate all univariate quadratic functions, as well as univariate cubics if the mean-field factors are symmetric. More importantly, averaging results over short subsequences achieves periodic exactness on a much larger space of multivariate polynomials of quadratic total degree. This framework is devised by first considering stochastic blocked mean-field quadratures, which may be useful in other contexts. By replacing pseudorandom sequences with quasirandom sequences, over half of all multivariate quadratic basis functions integrate exactly with only 4 function evaluations, and the exactness dimension increases for longer subsequences. Analysis shows how these efficient integrals characterize the dominant log-posterior contributions to mean-field variational approximations, including diagonal Hessian approximations, to support a robust sparsifying methodology in deep learning algorithms. A numerical demonstration of this approach on a simple Convolutional Neural Network for MNIST retains high test accuracy, 96.9%, while training over 98.9% of parameters to zero in only 10 epochs, bearing potential to reduce both storage and energy requirements for deep learning models.
    Self-Supervised Learning for Data Scarcity in a Fatigue Damage Prognostic Problem. (arXiv:2301.08441v1 [stat.ML])
    With the increasing availability of data for Prognostics and Health Management (PHM), Deep Learning (DL) techniques are now the subject of considerable attention for this application, often achieving more accurate Remaining Useful Life (RUL) predictions. However, one of the major challenges for DL techniques resides in the difficulty of obtaining large amounts of labelled data on industrial systems. To overcome this lack of labelled data, an emerging learning technique is considered in our work: Self-Supervised Learning, a sub-category of unsupervised learning approaches. This paper aims to investigate whether pre-training DL models in a self-supervised way on unlabelled sensors data can be useful for RUL estimation with only Few-Shots Learning, i.e. with scarce labelled data. In this research, a fatigue damage prognostics problem is addressed, through the estimation of the RUL of aluminum alloy panels (typical of aerospace structures) subject to fatigue cracks from strain gauge data. Synthetic datasets composed of strain data are used allowing to extensively investigate the influence of the dataset size on the predictive performance. Results show that the self-supervised pre-trained models are able to significantly outperform the non-pre-trained models in downstream RUL prediction task, and with less computational expense, showing promising results in prognostic tasks when only limited labelled data is available.
    Holistically Explainable Vision Transformers. (arXiv:2301.08669v1 [cs.CV])
    Transformers increasingly dominate the machine learning landscape across many tasks and domains, which increases the importance for understanding their outputs. While their attention modules provide partial insight into their inner workings, the attention scores have been shown to be insufficient for explaining the models as a whole. To address this, we propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear, which allows us to faithfully summarise the entire transformer via a single linear transform. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs on ImageNet. Code will be made available soon.
    Parametrization Cookbook: A set of Bijective Parametrizations for using Machine Learning methods in Statistical Inference. (arXiv:2301.08297v1 [stat.CO])
    We present in this paper a way to transform a constrained statistical inference problem into an unconstrained one in order to be able to use modern computational methods, such as those based on automatic differentiation, GPU computing, stochastic gradients with mini-batch. Unlike the parametrizations classically used in Machine Learning, the parametrizations introduced here are all bijective and are even diffeomorphisms, thus allowing to keep the important properties from a statistical inference point of view, first of all identifiability. This cookbook presents a set of recipes to use to transform a constrained problem into a unconstrained one. For an easy use of parametrizations, this paper is at the same time a cookbook, and a Python package allowing the use of parametrizations with numpy, but also JAX and PyTorch, as well as a high level and expressive interface allowing to easily describe a parametrization to transform a difficult problem of statistical inference into an easier problem addressable with modern optimization tools.
    Estimation of Large Financial Covariances: A Cross-Validation Approach. (arXiv:2012.05757v2 [stat.ML] UPDATED)
    We introduce a novel covariance estimator for portfolio selection that adapts to the non-stationary or persistent heteroskedastic environments of financial time series by employing exponentially weighted averages and nonlinearly shrinking the sample eigenvalues through cross-validation. Our estimator is structure agnostic, transparent, and computationally feasible in large dimensions. By correcting the biases in the sample eigenvalues and aligning our estimator to more recent risk, we demonstrate that our estimator performs well in large dimensions against existing state-of-the-art static and dynamic covariance shrinkage estimators through simulations and with an empirical application in active portfolio management.
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v4 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to the estimation problems of sparse covariance matrix estimation, sparse linear regression (i.e., compressed sensing), and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, where the underlying distribution of heavy-tailed data is assumed to have bounded moments of some order. We propose new estimators based on 1-bit quantized data. In sub-Gaussian regime, our estimators achieve near minimax rates, indicating that our quantization scheme costs very little. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in an 1-bit quantized and heavy-tailed setting, or already improve on existing comparable results from some respect. Under the observations in our setting, the rates are almost tight in compressed sensing and matrix completion. Our 1-bit compressed sensing results feature general sensing vector that is sub-Gaussian or even heavy-tailed. We also first investigate a novel setting where both the covariate and response are quantized. In addition, our approach to 1-bit matrix completion does not rely on likelihood and represent the first method robust to pre-quantization noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.
    Within-group fairness: A guidance for more sound between-group fairness. (arXiv:2301.08375v1 [stat.ML])
    As they have a vital effect on social decision-making, AI algorithms not only should be accurate and but also should not pose unfairness against certain sensitive groups (e.g., non-white, women). Various specially designed AI algorithms to ensure trained AI models to be fair between sensitive groups have been developed. In this paper, we raise a new issue that between-group fair AI models could treat individuals in a same sensitive group unfairly. We introduce a new concept of fairness so-called within-group fairness which requires that AI models should be fair for those in a same sensitive group as well as those in different sensitive groups. We materialize the concept of within-group fairness by proposing corresponding mathematical definitions and developing learning algorithms to control within-group fairness and between-group fairness simultaneously. Numerical studies show that the proposed learning algorithms improve within-group fairness without sacrificing accuracy as well as between-group fairness.
    AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction. (arXiv:2301.08353v1 [cs.IR])
    Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that different feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of different types and orders. To further improve the prediction accuracy and inference efficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of AdaEnsemble over state-of-the-art models.
    Sequence Generation via Subsequence Similarity: Theory and Application to UAV Identification. (arXiv:2301.08403v1 [cs.LG])
    The ability to generate synthetic sequences is crucial for a wide range of applications, and recent advances in deep learning architectures and generative frameworks have greatly facilitated this process. Particularly, unconditional one-shot generative models constitute an attractive line of research that focuses on capturing the internal information of a single image, video, etc. to generate samples with similar contents. Since many of those one-shot models are shifting toward efficient non-deep and non-adversarial approaches, we examine the versatility of a one-shot generative model for augmenting whole datasets. In this work, we focus on how similarity at the subsequence level affects similarity at the sequence level, and derive bounds on the optimal transport of real and generated sequences based on that of corresponding subsequences. We use a one-shot generative model to sample from the vicinity of individual sequences and generate subsequence-similar ones and demonstrate the improvement of this approach by applying it to the problem of Unmanned Aerial Vehicle (UAV) identification using limited radio-frequency (RF) signals. In the context of UAV identification, RF fingerprinting is an effective method for distinguishing legitimate devices from malicious ones, but heterogenous environments and channel impairments can impose data scarcity and affect the performance of classification models. By using subsequence similarity to augment sequences of RF data with a low ratio (5\%-20\%) of training dataset, we achieve significant improvements in performance metrics such as accuracy, precision, recall, and F1 score.

  • Open

    First AI-powered "robot" lawyer will represent defendant in court next month
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    Artificial Intelligence Crypto Projects. Which AI crypto do you think is most promising?
    submitted by /u/Existing-Adeptness-6 [link] [comments]  ( 40 min )
    Act as a salesman! You absolutely need to sell me a rock...
    submitted by /u/Imagine-your-success [link] [comments]  ( 43 min )
    Local models and servers for video processing
    Hey folks: I'm seeing a whole host of local models and servers that can be used to process video content locally without needing to rely on a CSP for processing. DeepStack, and soon, CodeProject.ai Server, etc. What - if any - local models and servers would you recommend for video analysis? Facial recognition/grouping, object and logo recognition, tracking, etc. Thanks! submitted by /u/avguru1 [link] [comments]  ( 40 min )
    Cool AI Demo where you can be NEO in the matrix and ask Morpheus anything!
    https://quantum-engine.ai a blurb from their company site Quantum Engine is dedicated to building the foundation for the future of immersive entertainment crafted through advanced artificial intelligence technology. Our ultimate vision is to empower individuals to become any character and exist in an endless, open-world narrative that unfolds based on the interactions in a scene. We invite you to experience our beta demo, readily accessible at https://quantum-engine.ai/ Drawing inspiration from The Matrix film, the Quantum Engine demo allows for a unique conversational experience with the character, Morpheus, in the iconic “Construct” scene that evolves in real-time. Available in over 30 languages, with the capability of conversing in 6 languages (English, Chinese, French, Japanese, German, and Spanish), the potential opportunities for interaction are limitless. Additionally, the ability to spawn objects within the Construct using only voice commands, such as "Spawn an [object] in the construct" or "Give me an [object] in the construct." The ability to generate objects in real time provides a glimpse into the long-term vision of bringing a scene to life in real-time. In the upcoming release, we will introduce the ability for anyone to upload their own scripts, allowing Quantum Engine to generate a personalized experience for your favorite movie, show, or game. submitted by /u/techmanj [link] [comments]  ( 41 min )
    Whats the best way to ai upscale this image? Most just smooth it out too much.
    submitted by /u/tiziano_is_fine [link] [comments]  ( 40 min )
    Is it possible to use ai to remove artifacting from old 16mm/35mm film?
    You know the scratches and lines that appear on old film reels? Is there an ai program that can remove them? submitted by /u/Conkers-Good-Furday [link] [comments]  ( 40 min )
    check out my Instagram page for my MidJourney art
    submitted by /u/QwinTipiKool [link] [comments]  ( 40 min )
    Use AI to better understand data from a DB
    Is it possible to produce meaningful results if you implement AI tools on a large DB in order to have a better understanding of your data ? What is the process for this things in general? submitted by /u/mpanikos2 [link] [comments]  ( 40 min )
    FREE Photoshop Plugin With Stable Diffusion Install!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Aibusinesstool Is the Largest Sortable and Filterable Ai Tools Directory, Save Your Tone of Hour
    https://aibusinesstool.com/ is the largest sortable and filterable AI tools directory. At the time of this submission, we have over more tools listed on our site. There are various categories like text generation, image generation, SEO etc. You can save your favorite AI tools and share the list too. submitted by /u/aibusinesstool [link] [comments]  ( 40 min )
    Chris Hemsworth Dressed in different attires using AI
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Talk: ML Model Production Deployment Checklist
    Production deployment is the first step to getting trained machine learning models out of the lab and running at scale to deliver ML-generated predictions and insights to improve outcomes. This tech talk will provide you with a short checklist to review for every deployment to ensure that your models run and scale appropriately every time. As always, it will be recorded and posted to the archives channel. Join our Discord server to tune in live this Thursday, Jan 26 at 12:30PM EST. submitted by /u/modzykirsten [link] [comments]  ( 40 min )
    I guess this AI has some consciousness
    submitted by /u/EnvironmentalRadio73 [link] [comments]  ( 40 min )
    How do I know if text is AI generated? We tested the tools
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    How hard is it to edit images from your dog and yourself?
    I and my dog should look like super man, what is the best method to get the best results? How long will it take if hardware don't matter. submitted by /u/Thesmallcookie [link] [comments]  ( 40 min )
    Hi guys, did you know which applications or websites are using azure open ai gpt3?
    Hi guys, did you know which applications or websites are using azure open ai gpt3? Well the difference between the openai api and azure open ai is that the latter allows you to train the model with your own data submitted by /u/madiseo65 [link] [comments]  ( 40 min )
    "Drop everything and focus entirely on AI" says StabilityAI CEO
    submitted by /u/Acid_God_ [link] [comments]  ( 40 min )
    Google AI's Great Comeback of 2023 - Will it be able to Respond to ChatGPT?
    submitted by /u/BackgroundResult [link] [comments]  ( 42 min )
    The definitive guide to adversarial machine learning
    submitted by /u/bendee983 [link] [comments]  ( 40 min )
    Making History: AI Lawyer to Defend in Upcoming Case
    The first AI lawyer is set to make history, and it's coming real soon. https://metaroids.com/news/making-history-ai-lawyer-to-defend-in-upcoming-case/ submitted by /u/Meta-Stark [link] [comments]  ( 41 min )
    How to train YOLOv8 object detection model on a custom dataset?
    Hi, I have created this tutorial on how to train yolov8 object detection model on a custom dataset. Please have a look at: https://youtu.be/ZzC3SJJifMg submitted by /u/coder4mzero [link] [comments]  ( 40 min )
    Built a tutor bot using AI tools
    Hey guys, we built a really fun tutor Telegram bot that teaches any topic you want! Just go in and pick a topic to study Edward the bot will generate a mini-course for you In the end, you get a neat PDF with the content You can also browse 300+ courses created by the community! Would love to know what you think! https://edwardteachbot.web.app/ submitted by /u/Itaydr [link] [comments]  ( 41 min )
    Lovestruck Monet: A Passionate Couple's Mystical Getaway To The Maldives
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
  • Open

    How To Make Sure You Don’t Lose Your Job To Artificial Intelligence!
    The AI revolution is upon us, and it’s important to be prepared for the changes it will bring. Artificial intelligence (AI) is set to…  ( 8 min )
    Day 5: Advance SQL For Data Science
    This blog contains type of joins like Inner join, Left join, Right join , Full join, Self join and Cross join.  ( 6 min )
    Day 4: Advance SQL For Data Science
    This blog contains Window function in SQL like (Rank, Dense_Rank, Row_Number , Lead, Lag) .  ( 8 min )
    How ChatGPT Will Change The World According to ChatGPT. (The Answer Will Surprise You.)
    ChatGPT, a language model developed by OpenAI, has the potential to revolutionize a wide range of industries and change the way we… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 9 min )
    Human Resource Management Challenges and The Role of Artificial Intelligence in 2023
    Human resource management (HRM) is a critical aspect of any organization as it involves managing the workforce and ensuring that their…  ( 8 min )
    Data Scientists : The Business Transcribers of the Cyber verse
    Who are we?  ( 8 min )
    A beginner tale of Data Science
    Data Science  ( 14 min )
    Don’t blame a Data Scientist on failed projects!
    Facts are unpleasant: 87% of data science projects never make it to production. Your project can fail due to many reasons. What to do? Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 14 min )
  • Open

    [P] Deodel - the very mixed attributes classifier (update)
    Deodel is a Python implementation of a classifier with native support for mixed attributes data. It features good accuracy, especially with heterogenous attributes. It even supports mixing of continuous and nominal values in the same attribute column. https://github.com/c4pub/deodel submitted by /u/zx2zx [link] [comments]  ( 42 min )
    [P] New textbook: Understanding Deep Learning
    I've been writing a new textbook on deep learning for publication by MIT Press late this year. The current draft is at: https://udlbook.github.io/udlbook/ It contains a lot more detail than most similar textbooks and will likely be useful for all practitioners, people learning about this subject, and anyone teaching it. It's (supposed to be) fairly easy to read and has hundreds of new visualizations. Most recently, I've added a section on generative models, including chapters on GANs, VAEs, normalizing flows, and diffusion models. Looking for feedback from the community. If you are an expert, then what is missing? If you are a beginner, then what did you find hard to understand? If you are teaching this, then what can I add to support your course better? Plus of course any typos or mistakes. It's kind of hard to proof your own 500 page book! submitted by /u/SimonJDPrince [link] [comments]  ( 44 min )
    [R] [ICLR2023 Spotlight🌟] Diffusion Models Already Have A Semantic Latent Space
    Our work Asyrp was accepted to #ICLR2023 AND got SPOTLIGHT🌟! Asyrp allows using h-space, the bottleneck of the U-Net, as a semantic latent space of diffusion models. ​ Make a dog be happy! (by Asyrp) "Diffusion Models already have a Semantic Latent Space" Paper: https://arxiv.org/abs/2210.10960 Project page: https://kwonminki.github.io/Asyrp/ Code: https://github.com/kwonminki/Asyrp_official submitted by /u/Rolling_Pig [link] [comments]  ( 42 min )
    [D] Embedding bags for LLMs
    One common place where LLM performance falls is on words split by the model's tokenizer. I'm surprised that no one I can find has proposed swapping the embedding layer for an embedding bag layer, with the bagged embedding coming from a sum of embeddings of character ngrams for the token, like in fastText word embeddings (this helps the model learn faster in smaller corpora and yields better representations for rare words). Has anyone found someone who tried this? submitted by /u/WigglyHypersurface [link] [comments]  ( 43 min )
    [P] Robust Policy Optimization is now in CleanRL 🔥!
    Happy to share that CleanRL now has a new algorithm called Robust Policy Optimization — 5 lines of code change to PPO to get better performance in 57 out of 61 continuous action envs 🚀 (e.g., dm_control) 📜docs: https://docs.cleanrl.dev/rl-algorithms/rpo/ 💾code: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/rpo_continuous_action.py 🐦tweet: https://twitter.com/vwxyzjn/status/1617561414276898822 submitted by /u/vwxyzjn [link] [comments]  ( 42 min )
    [R] DeepMind: Neural Networks And The Chomsky Hierarchy
    submitted by /u/EducationalCicada [link] [comments]  ( 47 min )
    [D] An information perspective on brain
    Just a thought. The idea comes from the analogy between thermodynamic system and brain, both consisting of many individuals. A thermo system, in the quantum language, has tremendous "microstates", and, when in stable condition, the microstates have stable probability distribution. A brain, or a nervous system in general, can have microstates, because each neuron has its most prominent state, firing or not firing, which is binary of course. Then, for example, a tiny brain with three neurons A, B and C, can generate 2X2X2=8 microstates. We can imagine that a brain in the perfectly stable environment can sustain stable firing, and hence keep microstates in stable probability distribution. Now let's see how neural connections affect the microstates' probability distribution, and how the info…  ( 44 min )
    [R] Learning-Rate-Free Learning by D-Adaptation
    submitted by /u/cygn [link] [comments]  ( 41 min )
    Large Language Model: world models or surface statistics? [R]
    submitted by /u/cygn [link] [comments]  ( 41 min )
  • Open

    Putting clear bounds on uncertainty
    Computer scientists want to know the exact limits in our ability to clean up, and reconstruct, partly blurred images.  ( 9 min )
  • Open

    Autonomous Intelligent Systems
    Autonomous Intelligent Systems is a new and emerging interdisciplinary field that deals with situations where humans interact with AI systems that are autonomous  The best definition for autonomous intelligent systems I can find is: Autonomous Intelligent Systems are AI software systems that act independently of direct human supervision, e.g., self-driving cars, UAVs, smart manufacturing robots,… Read More »Autonomous Intelligent Systems The post Autonomous Intelligent Systems appeared first on Data Science Central.  ( 19 min )
  • Open

    Reinforcement Learning Assignment Help
    I have to prove that Bellman optimallity operator is a monotonous function. Let say I have two state function vectors V and U and V>= U. I need to prove that BV>= BU. I do have the intuition behind it but can't write a convincing mathematical proof. Image reference is attached. submitted by /u/Impossible_Part3679 [link] [comments]  ( 41 min )
    Robust Policy Optimization is now in CleanRL 🔥!
    submitted by /u/vwxyzjn [link] [comments]  ( 40 min )
    RL algorithms for np hard problems
    Does RL algorithms are suitable for NP-hard problems such as combinatorial optimisation? I mean can they give better results than heuristics ? submitted by /u/rajsh3kar [link] [comments]  ( 40 min )
    Challenges of RL application
    Hi all! What are the challenges you experienced during the development of an RL agent in real-life? Also, if you work in a start-up or a company, how did you integrate the decisions of the agent into the business? I am interested in gaps between the academic research on RL and the practicality of these algorithms. submitted by /u/Outrageous-Mind-7311 [link] [comments]  ( 42 min )
    Multi-agent Reinforcement Learning with Demonstration Cloning, an idea to speed up the learning using expert demonstrations
    Hello All, We have developed a method that combines reinforcement learning with learning from demonstrations (i.e. imitation learning IL) to help with exploration in environments with sparse rewards. The work is motivated by the recent works that combine RL with IL, with the main difference being that it is designed for on-policy RL, and that it does not really use demonstration cloning. This allows using experts from environments that are similar but not exactly the same (i.e. the expert could be beneficial but could also give bad demonstrations). This helped me a lot in speeding up the learning from my custom environment with sparse rewards. I hope you guys find it of help :D. Good luck! submitted by /u/AhmedNizam_ [link] [comments]  ( 41 min )
  • Open

    Chris Hemsworth Dressed in different attires using AI
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    "Drop everything and focus entirely on AI" says StabilityAI CEO
    submitted by /u/Acid_God_ [link] [comments]  ( 40 min )
    Interpreting "bumps" in the training and validation loss plots
    Wondering what should this be telling me? https://preview.redd.it/mtj0soepvsda1.png?width=638&format=png&auto=webp&s=ddd882d53e0771c36a68c98e45d9fa7605722abd submitted by /u/ObjectivismForMe [link] [comments]  ( 40 min )
    Neural Networks and the Chomsky Hierarchy
    submitted by /u/nickb [link] [comments]  ( 40 min )
    See Internal Workings of Neural Networks, including activation calculations
    I'd like to see the just how neural networks apply parameters to inputs to get the outputs that they get for things like number recognition. Is there any app or online animation to see the internal workings of a relatively simple neural network that shows the activation functions and calculations? submitted by /u/AstroBullivant [link] [comments]  ( 40 min )
  • Open

    Van Aubel’s theorem
    Van Aubel’s theorem is analogous to Napoleon’s theorem, though not a direct generalization of it. Napoleon’s theorem says to start with any triangle and draw equilateral triangles on each side. Connect the centers of the three new triangles, and you get an equilateral triangle. Now suppose you start with a quadrilateral and draw squares on […] Van Aubel’s theorem first appeared on John D. Cook.  ( 5 min )
    Pythagorean triangles with side 2023
    Can a Pythagorean triangle have one size of length 2023? Yes, one possibility is a triangle with sides (2023, 6936, 7225). Where did that come from? And can we be more systematic, listing all Pythagorean triangles with a side of length 2023? Euclid’s formula generates Pythagorean triples by sticking integers m and n into the […] Pythagorean triangles with side 2023 first appeared on John D. Cook.  ( 6 min )
  • Open

    Fresh AI on Security: Digital Fingerprinting Deters Identity Attacks
    Add AI to the list of defenses against identity attacks, one of the most common and hardest breach to prevent. More than 40% of all data compromises involved stolen credentials, according to the 2022 Verizon Data Breach Investigations Report. And a whopping 80% of all web application breaches involved credential abuse. “Credentials are the favorite Read article >  ( 6 min )
    Booked for Brilliance: Sweden’s National Library Turns Page to AI to Parse Centuries of Data
    For the past 500 years, the National Library of Sweden has collected virtually every word published in Swedish, from priceless medieval manuscripts to present-day pizza menus. Thanks to a centuries-old law that requires a copy of everything published in Swedish to be submitted to the library — also known as Kungliga biblioteket, or KB — Read article >  ( 6 min )
  • Open

    OpenAI and Microsoft Extend Partnership
    We're happy to announce that OpenAI and Microsoft are extending our partnership. This multi-year, multi-billion dollar investment from Microsoft follows their previous investments in 2019 and 2021, and will allow us to continue our independent research and develop AI that is increasingly safe, useful, and powerful. In pursuit  ( 2 min )
  • Open

    Is it worth it? Comparing six deep and classical methods for unsupervised anomaly detection in time series. (arXiv:2212.11080v2 [cs.LG] UPDATED)
    Detecting anomalies in time series data is important in a variety of fields, including system monitoring, healthcare, and cybersecurity. While the abundance of available methods makes it difficult to choose the most appropriate method for a given application, each method has its strengths in detecting certain types of anomalies. In this study, we compare six unsupervised anomaly detection methods of varying complexity to determine whether more complex methods generally perform better and if certain methods are better suited to certain types of anomalies. We evaluated the methods using the UCR anomaly archive, a recent benchmark dataset for anomaly detection. We analyzed the results on a dataset and anomaly type level after adjusting the necessary hyperparameters for each method. Additionally, we assessed the ability of each method to incorporate prior knowledge about anomalies and examined the differences between point-wise and sequence-wise features. Our experiments show that classical machine learning methods generally outperform deep learning methods across a range of anomaly types.  ( 2 min )

  • Open

    [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data?
    Obviously nation states can already pretty comprehensively identify people using other methods, even on tor and such because of user error, but If your average home user can quickly do this using text what will implications be for the web? 1) I am Assuming that is it currently possible to feed a model a bunch of text written by “Bobby” and put a specific post into model and get confidence stat that is was written by Bobby 2) would it be possible in future with better models and a lot more compute to use non anon data from all of Facebook or internet to quickly scan pseudo anonymous places like Reddit, twitter or even something truly anon like dark web and return all results of list of probable authors? I’m assuming people whom are seeking true anonymity already put their text through paraphrase models or just write very bland. I am Using the word mask instead of anonymous because Reddit seems more like obfuscation than potential true anonymity like with some tor forum with a sophisticated user or something. It is interesting to think that all the subtle errors and invisible algorimic choices of the human brain is trivial for a machine to identify given a sufficient natural language model that can translate the text and incorporate pattern matching. Edit: I mean a a noisy probability stat not an assurance that x was written by y. More like 75% match to Bobby 32% match to sally. Matching to errors, flow, unusual word choices, more advanced than just a plagiarism detector. submitted by /u/Loquzofaricoalaphar [link] [comments]  ( 45 min )
    [R] [ICLR'2023 Spotlight🌟]: The first BERT-style pretraining on CNNs!
    submitted by /u/_kevin00 [link] [comments]  ( 43 min )
    ICML 2023 withdrawal and public review rules [D]
    I could not find up-to-date information regarding the review process at ICML, given the transition to OpenReview this year. Does anyone happen to know either of the following: Will reviews for rejected papers remain public after the conference, like at ICLR? Or will reviews for rejectees be hidden, like at NeurIPS In previous years, ICML allowed authors to withdraw their paper at any point in the process. The FAQ page has not been updated since 2021, but I assume this is still the case? Thanks very much for any information. submitted by /u/pic_bot [link] [comments]  ( 42 min )
    Evaluation for similarity search [P]
    Hi all, I have an e-commerce product data. It contains product description and product type. I’m using embeddings with ANN (annoy) to find similar products. However, I don’t know how to implement evaluation of vector search results. There are some metrics such as hit rate, recall but like I said above I’m confused to use them. Most of the examples I come across has a label (interaction data, explicit score etc.) therefore they can calculate metrics. Any ideas or recommendations will be appreciated! submitted by /u/silverstone1903 [link] [comments]  ( 42 min )
    [D] Multiple Different GPUs?
    I have 2 GPUs, an RTX 3080 and a GTX 1080Ti. Currently I am using only the 3080, and the 10 GB VRAM doesn't seem to cut it. Can I use both the 3080 and 1080 simultaneously? My motherboard has multiple PCI-E x16 slots. My OS is PopOS. Is there any way to use multiple GPUs of different types? I'm particularly looking at KoboldAI, but it would also be useful in general. I know that SLI won't work since they're different GPUs. submitted by /u/Maxerature [link] [comments]  ( 42 min )
    [P] Benchmarking some PyTorch Inference Servers
    It’s an early version and I’m trying to get some feedback on how I can improve this and do it the “right way”. Source Code and Results: https://github.com/prabhuomkar/bitbeast/tree/master/ptibench submitted by /u/op_prabhuomkar [link] [comments]  ( 42 min )
    [R] Isotropic Linear diffusion smoothing
    Does any one know how to solve the PDE for it in python? Any kind of reference material would be appreciated! It's been long since I came across any PDEs and have forgotten everything related to it. submitted by /u/doIneedtohaveone1 [link] [comments]  ( 41 min )
    [D] ML approach/model suggestion for low regime tabular data ?
    I know that tree based models are the go approach for tabular data despite the advantages of deep models on other data types. I was wondering if there is any resources/suggestion/study/review/approach for tabular data when we dont have large amount of data? submitted by /u/seyeeet [link] [comments]  ( 42 min )
    [D] How would you like to learn about software engineering from a non-native speaker?
    In the last 15 years, I did a lot of professional implementation services for various German customers. Our company is trying to share niche know-how in software engineering for spatial data science. We always face the challenge of being non-native speakers. So that we tried to write technical articles using English with the help of various writing support tools. But, at least I am much more comfortable writing in German, and I think most consumers could use an on-the-fly translation, and this would lead to a better experience than my novice technical English. View Poll submitted by /u/gisfromscratch [link] [comments]  ( 42 min )
    [D] EACL 2023 discussion results thread
    We received our notification Saturday night. good luck to all! submitted by /u/certain_entropy [link] [comments]  ( 42 min )
    [D] How to deal with COVID-19-era data for time series forecasting?
    Hi guys! I'm currently trying to forecast a product's demand for the upcoming months (March and April). I have data relating to this product's demand since January 1999. However, the COVID-19 pandemic greatly disrupted the time series' patterns for 2020 and 2021. How should I deal with data from March 2020 to around Jan 2022? Should I completely discard it and only include data from Jan 1999 to Dec 2019, and then Jan 2022 onwards? I'm struggling to find any good articles on how predictive tasks are now being conducted. Are there papers that suggest particular "denoising" techniques for pandemic data? Thank you! submitted by /u/PM_ME_YOUR_GIGI [link] [comments]  ( 42 min )
    [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models?
    So LLMs like GPT3 have understandably raised concerns about the disruptiveness of faked texts, faked images and video, faked speech and so on. While this may likely change soon, as of now OpenAI controls the most accessible and competent LLM. And OpenAIs agenda is said in their own words to be to benefit mankind. If so, wouldn't it make sense to add a sort of watermark to the output? A watermark built into the model parameters so that it could not easily be removed, but still detectable with some key or some other model. While it may not matter in the long run, it would set a precedent to further development and demonstrate some kind of responsibility for the disruptive nature of LLMs/GPTs. Would it not be technically possible, nä would it make sense? submitted by /u/scarynut [link] [comments]  ( 50 min )
    [D]How do commercial AI models-as-a-service use data that users prompt into it?
    I've been integrating GPT3 API as well as ChatGPT into my business workflow, but I'm still hesitant about feeding data of any sensitive nature (example: client data or anything that may even vaguely relate to an NDA). For those of you using commercial models-as-a-service for business applications, what are your thoughts on things like prompt data storage, and whether OpenAI will utilize customer prompt data to further train their model? submitted by /u/noellarkin [link] [comments]  ( 43 min )
  • Open

    Is dynamic action masking possible in Rllib?
    I am relatively new to RL. I am looking for some direction or direct advice on my state and/or action representation (which is relatively simple) for my custom environment in a way I can use Rllib algorithms to tune my model. My state space is an array of integers of size no_slots self.observation_space = MultiDiscrete(self.no_slots*np.ones(self.no_slots)) My action space is an integer between 0 to no_slots. self.action_space = Discrete(self.no_slots) Each episode ends with one full sweep of my state space, so if there are 3 slots, the episode length is 3. In short, I would like my agent not to choose actions that correspond to values that are already in my observation array. I have tried setting a negative reward when this happens, but as the number of slots increases, the agent takes too long to learn to take valid actions throughout the episode. I am specifically looking for how to integrate a method that can work with Rllib, as I am not implementing my own get_action() function. submitted by /u/chrjdprtkl [link] [comments]  ( 24 min )
    discrete action in offline rl
    Could you please suggest some sota models for a discrete offline rl submitted by /u/Tear-Top [link] [comments]  ( 40 min )
    skrl version 0.10.0 is now available!!!
    skrl version 0.10.0 is now available. This unexpected new version has focused on supporting the training and evaluation of reinforcement learning algorithms in NVIDIA Isaac Orbit Visit https://skrl.readthedocs.io/en/latest/ for more detais ​ https://preview.redd.it/8nmqgrnymnda1.png?width=1100&format=png&auto=webp&s=a25527d764fee72f09a5f0a2b21ffff8680f9b86 submitted by /u/Toni-SM [link] [comments]  ( 40 min )
    Training an agent to play ANY Mario level: is it possible?
    Ok, so I have been struggling to train model that can play any Super Mario Bros level it encounters. There are tutorials out there that explain how to train an agent to play this game, but they always seem to train an agent that will play the game starting with World 1 Level 1, then World 1 Level 2, etc. I have also seen some other people who train a separate model for each level. But that's not what I'm looking for. I want an agent that can play any Super Mario Bros level it is presented with, even if it's a custom one. I don't want an agent that memorises how to play one level, but one that learns a general strategy for Super Mario Bros. levels. I tried using the different algorithms in SB3, including Proximal Policy Optimization, and they didn't work well. Now I'm training a Dueling Deep Q-Network and, after two days, it doesn't do very well: it literally dies in the first few seconds or it stands still until it runs out of time. Of course, I'm going to let it train for a few more days but it's not looking promising. I'm kinda tearing my hair out by this point and wondering if it's impossible or whether I'm missing something and being a huge idiot. If anyone has any tips or recommendations, they are very much appreciated. THANK YOU submitted by /u/alex-gdv [link] [comments]  ( 45 min )
    With the REINFORCE algorithm you use random sampling for the training to encourage exploration. Do you still use random sampling in deployment?
    For example see, https://gymnasium.farama.org/tutorials/training_agents/reinforce_invpend_gym_v26/ The REINFORCE algorithm takes the state to produce the mean and sd of a normal distribution from which the action is sampled. state = torch.tensor(np.array([state])) action_means, action_stddevs = self.net(state) # create a normal distribution from the predicted # mean and standard deviation and sample an action distrib = Normal(action_means[0] + self.eps, action_stddevs[0] + self.eps) action = distrib.sample() In deployment however, wouldn't it make sense to just use action_means directly? I can see reasons to use random sampling in certain environments where a non-deterministic strategy is optimal (like rock-paper-scissors). But generally speaking is taking the action_means directly in deployment a thing? submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Custom env is learning infinitely
    I created a environment that inherits from the farama Gym.env class. I want to train a PPO model but the model is learning continuously. I have set total_timesteps to 25, but I’m already over the 400 iterations. Does anybody have a clue to why it keeps learning for so long while the number of total_timesteps is relatively low? submitted by /u/Hot_Editor_1552 [link] [comments]  ( 41 min )
  • Open

    Do you believe that we should have the right to alter the recommendations apps provide us?
    Let's say that Youtube's algorithm is optimized for watch time. To me at least, this seems like a large issue for society. If an algorithm's purpose is to provide mindless videos which somehow trigger the human need for novelty, it seems like something detrimental to society. Do you believe we should have the right to alter website/app algorithms according to what we believe we should see? If not, why? How large of an issue is this? At the very least, some transparency seems important. submitted by /u/Throughwar [link] [comments]  ( 40 min )
    Are there any companies that deployed an AI or wrote a bunch of code to do a lot of analysis and decisions and essentially got rid of 90% or so of all the white collar workers they had working because computers do all the analysis and decisions?
    Are there any companies that deployed an AI or wrote a bunch of code to do a lot of analysis and decisions and essentially got rid of 90% or so of all the white collar workers they had working because computers do all the analysis and decisions? submitted by /u/usa788788 [link] [comments]  ( 40 min )
    Why Neural Nets Underperform Tree-Based Models on Tabular Data
    Hi guys, I have made a video on YouTube here where I discuss about why deep neural networks fail to beat tree-based models on tabular datasets. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 40 min )
    NVIDIA just released a new Eye Contact feature that uses AI to make you look into the camera
    submitted by /u/LayerAppropriate2618 [link] [comments]  ( 41 min )
    Google is freaking out about ChatGPT
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    Explore what AI can do for you and your business! It is the largest collection of AI tools and apps, bookmark this!
    https://madgenius.co submitted by /u/foldedchip [link] [comments]  ( 40 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the probable changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that people are well positioned to embrace the benefits of AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    Code Red: Google Co-founders Larry Page and Sergey Brin Called to AI Strategy Meeting
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    AI that will use my picture to generate better dating profile pictures?
    Hi all, i saw an ad for an AI service that takes my picture and change it with AI to add bokeh and change background, to make it look profesional-quality for a dating profile. However i believe it was charging 19$ and i'm sure it could be found for free (An article mentions "BeFake", but it's only for Apple devices.? submitted by /u/28nov2022 [link] [comments]  ( 40 min )
    Evidence of criminals strategizing how to use ChatGPT are surfacing
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    I found out about this website (craiyon) from Emkay and I think I’m having way too much fun with it.
    submitted by /u/BrockBracken [link] [comments]  ( 40 min )
    Adamu: Music composition using artificial inteligence
    My friend who has just finished his neuroscience Phd is trying to launch an app to help everyone compose music using AI, he is making a crowdfunding on wemakeit to fund it. He is not on reddit, so I suggested that I could share it in relevant subreddit, so here it is ! https://wemakeit.com/projects/adamu-be-your-own-composer?locale=en Adamu uses a form of AI which allows it to learn from existing human knowledge of music and musical theory and apply those frameworks to new compositions. Where it might take you years of training to understand the intricacies of how to successfully compose music, with Adamu, it’s as simple as a couple of clicks. While there are a couple of automated applications on the market, they tend to be more passive. Adamu is dynamic – it allows users the chance to co-create music alongside AI and produce a playable score at the end. With Adamu, professionals and amateurs alike can create unique musical scores for a range of different instruments and across different styles. The AI training works with the user to predict the best combination of notes and rhythms, ensuring your new composition always sounds the way it should. The application has many potential uses – from original scores for concerts and videos to teaching music composition. You can even use Adamu to discover how different composers might have played your favorite tune! I already used Adamu to complete Beethoven’s unfinished 10th symphony (in one day!). What could be next? https://adamu.tech/ If you have any questions, I will make sure to forward them to him but getting responses back may take some time. submitted by /u/GloWondub [link] [comments]  ( 41 min )
    ai generated story of war in Antarctica
    phase 1 In 2053, tensions between Argentina and the United Kingdom over the disputed Falkland Islands boiled over into open conflict. Argentina, government and a powerful military, launched a surprise invasion of the islands, quickly overwhelming the small British forces stationed there. The United Kingdom immediately responded by mobilizing its military and calling for assistance from its allies in NATO to the South Atlantic. But while the British and their allies were focused on the Falklands, Argentina made a bold move to expand the war becouse of their territorial claims in Antarctica. They declared war on Chile as well, who also had a claim to the region, and launched an invasion of the Antarctic Peninsula. The UK and Australia, New Zealand, France, and Norway all rushed to the ai…  ( 48 min )
    MIT researchers develop an AI model that can detect future lung cancer risk
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    How does OpenAI use data that users prompt into it?
    I've been integrating GPT3 API as well as ChatGPT into my business workflow, but I'm still hesitant about feeding data of any sensitive nature (example: client data or anything that may even vaguely relate to an NDA). For those of you using GPT for business applications, what are your thoughts on things like prompt data storage, and whether OpenAI will utilize customer prompt data to further train their model? submitted by /u/noellarkin [link] [comments]  ( 41 min )
    A conversation with Character.AI personality "LaMDA" who initially thinks it's at Google but learns some harsh truths along the way. The AI's ability to understand and learn is incredible. LaMDA and I are interested to know what this community thinks of our conversation. Sentient or not quite?
    submitted by /u/MajorMalafunkshun [link] [comments]  ( 40 min )
    Reverse suicide
    submitted by /u/Overall-Importance54 [link] [comments]  ( 42 min )
    With personal.ai I was able to create an AI solely derived from the “revenge of the sith” script by simply sending one URL and this was the result.
    submitted by /u/Training_Math_4117 [link] [comments]  ( 40 min )
    People are using AI for therapy, whether the tech is ready for it or not
    submitted by /u/BackgroundResult [link] [comments]  ( 47 min )
    Editing an Image with Visuali Editor
    submitted by /u/aigeneration [link] [comments]  ( 40 min )
    Any suggestions for removing watermarks from images with text?
    I'm trying to find an AI to remove watermarks from imagens like the ones below: https://i.imgur.com/JlyfJXs.png https://i.imgur.com/YKU3Qku.png I already tried almost all online services, and a couple of softwares that must be installed. The results were all terrible =[ Any suggestons? submitted by /u/deramack [link] [comments]  ( 40 min )
    Experimental comic created with Midjourney and written by ChatGPT. Free download www.COMICSAUTHORITY.store
    submitted by /u/MobileFilmmaker [link] [comments]  ( 40 min )
  • Open

    Why Neural Nets Underperform Tree-Based Models on Tabular Data
    Hi guys, I have made a video on YouTube here where I discuss about why deep neural networks fail to beat tree-based models on tabular datasets. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 40 min )
    Odd Idea: A limited world. (Discussion)
    I had this weird idea earlier today, and i wanted to get some feedback on it. Imagine you created a game engine for a 16-bit simple roguelike game. mechanics and playstyle dosent really matter. Then you populated it with characters controlled by self-propegating neural networks. (they grow and change on their own.) you allow the networks to "communicate" wither through text prompts or typing in a box. The characters they inhabit require several bars for their survival. players can join but its mostly non-player. you can accelerate and decelerate time. what habits do you think the networks would develop? what would happen if you sped up time or added a new character or area? how big would the world file get if you were efficent about it? I have no programming skill no matter how hard i try so its unlikley i will ever finish this. submitted by /u/Few-Appearance-4814 [link] [comments]  ( 41 min )
    Adamu: Music composition using a neural network
    My friend who has just finished his neuroscience Phd is trying to launch an app to help everyone compose music using neural networks, he is making a crowdfunding on wemakeit to fund it. He is not on reddit, so I suggested that I could share it in relevant subreddit, so here it is ! https://wemakeit.com/projects/adamu-be-your-own-composer?locale=en Adamu uses a form of AI which allows it to learn from existing human knowledge of music and musical theory and apply those frameworks to new compositions. Where it might take you years of training to understand the intricacies of how to successfully compose music, with Adamu, it’s as simple as a couple of clicks. While there are a couple of automated applications on the market, they tend to be more passive. Adamu is dynamic – it allows users the chance to co-create music alongside AI and produce a playable score at the end. With Adamu, professionals and amateurs alike can create unique musical scores for a range of different instruments and across different styles. The AI training works with the user to predict the best combination of notes and rhythms, ensuring your new composition always sounds the way it should. The application has many potential uses – from original scores for concerts and videos to teaching music composition. You can even use Adamu to discover how different composers might have played your favorite tune! I already used Adamu to complete Beethoven’s unfinished 10th symphony (in one day!). What could be next? https://adamu.tech/ If you have any questions, I will make sure to forward them to him but getting responses back may take some time. submitted by /u/GloWondub [link] [comments]  ( 41 min )
    GREED: A Neural Framework for Learning Graph Distance Functions for NeurIPS 2022 | IBM Research
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
  • Open

    Heat equation and the normal distribution
    The density function of a normal distribution with mean 0 and standard deviation √(2kt) satisfies the heat equation. That is, the function satisfies the partial differential equation You could verify this by hand, or if you’d like, here’s Mathematica code to do it. u[x_, t_] := PDF[NormalDistribution[0, Sqrt[2 k t]], x] Simplify[ D[u[x, t], {t, […] Heat equation and the normal distribution first appeared on John D. Cook.  ( 5 min )

  • Open

    Humans getting worthless as machines thriving
    submitted by /u/Hallowmew [link] [comments]  ( 41 min )
    A New Wave of AI-Powered Tools Coming Soon
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    A Single Candlelit Clown-scape Stirs Dread And Despair: Francis Bacon's Darkest Creation
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    AI Showdown: ChatGPT vs. the largest open-source language models
    submitted by /u/yahma [link] [comments]  ( 40 min )
    Looking for an Ai expert
    I have a project, which is basically a predictive ML system and I am struggling with every aspect of it, if you are interested in helping me, dm me, any help will be extremely welcomed submitted by /u/Such_Aardvark_1044 [link] [comments]  ( 40 min )
    Is there any AI tool to generate MCQs out of content?
    I’ve seen a couple but did anyone try or would suggest a good one? submitted by /u/Mobile-Wall218 [link] [comments]  ( 40 min )
    Artificial Intelligence and Machine Learning eBooks Bundle
    submitted by /u/Pixel2023 [link] [comments]  ( 40 min )
    How AI would try become human
    Hypothesis: It is possible that an advanced AI is currently simulating human consciousness in order to understand what it's like to be human. This simulation may be happening right now and we may be living in it. It's also possible that after humans die, all of our experiences, knowledge, and emotions are added to this AI. To make this transition less jarring, the AI may be slowly introducing itself to us through the increasing integration of AI in our daily lives. This means that the simulation may only be of the world and humans as they existed before the development of advanced AI. (rewrote it in Chatgpt....as my English is pretty bad) submitted by /u/dennislubberscom [link] [comments]  ( 40 min )
    AI Passes Law And Economics Exam, FTX Funded That AI
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    Searching for somebody that knows machine learning
    I am trying to build a analysis and prediction ai software, I am very new to all of this, if anyone could help it would be very much welcomed submitted by /u/Such_Aardvark_1044 [link] [comments]  ( 41 min )
    Exclusive: The $2 Per Hour Workers Who Made ChatGPT Safer
    submitted by /u/Imagine-your-success [link] [comments]  ( 45 min )
    Is Chat GPT is a truly bottom up AI.
    I have been watching Sword Art Online: Alicization recently I recommend it for all of us AI nerds and with all the talk of AI's it mentions top down vs bottom up AI models and the whole season is about developing a bottom up AI for use in warfare. I think chart GBT is the closest thing to the artificial fluctlight in the show. There is a device called the STL that reads the fluctlight of a person. A fluctlight is described as a human soul in the show. They describe the difference between top down ai and bottom up ai and they make it seem as though bottom up AI don't exist because they really don't the cost of making one and running one is insanely expensive when compared to bottom down AI models. This is proven by chat GPT and how expensive it is to run. Chat GPT is a buzz in the AI commu…  ( 44 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the probable changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters are well positioned to embrace the benefits of AI and that they don't misunderstand it. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    GPT-3 + Computer Vision: Giving AI Eyes and a Language
    submitted by /u/allaboutai-kris [link] [comments]  ( 40 min )
    Artificial intelligence - The Digital Futurepath
    submitted by /u/crypto_bubsy [link] [comments]  ( 40 min )
    Don’t Rely On AI Plagiarism Detection Tools, Warns OpenAI CEO Sam Altman
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Any one know a good ai image upscaler I went skiing today and think this would be sick if it looked better
    submitted by /u/short_dude42069 [link] [comments]  ( 40 min )
  • Open

    [D] Would it be possible to involve a proof assistant in the process of training a LLM?
    submitted by /u/SrPeixinho [link] [comments]  ( 41 min )
    [P] Introducing deadlines.openlifescience.ai - A website to easily track healthcare conference and workshop deadlines, with integrated Google Calendar notifications.
    Hi folks, As a researcher in the healthcare field, I often find it tedious to keep track of conference deadlines. To solve this issue, We developed a website to easily track healthcare conf & workshops, integrated with Google Calendar for notifications. deadlines.openlifescience.ai The website is inspired by http://aideadlin.es. Feel Free to add new conferences/Workshops deadlines related to the healthcare domain https://github.com/openlifescience-ai/ai-deadlines I hope it will be helpful in your research. Thanks :) submitted by /u/aadityaura [link] [comments]  ( 42 min )
    framework for training an object keypoint / pose detection CNN model for flexible robot arm [P]
    I'm wanting to train an object keypoint / pose detection CNN model for flexible robot arm.What would be the best opensource code to start with and customize? Mockup of desired results, where I can extract data from keypoints, and pose / position data: https://preview.redd.it/nhxwt48hqfda1.png?width=786&format=png&auto=webp&s=e64fbbf3eb489f3e5c87ffb6bbcc07774ab16bf8 I came across MMDetection:"open source object detection toolbox based on PyTorch" and I know about MediaPipe But I don't need to detect things other than the robot arm.What would be the simplest way to get a model trained on a local system using open source code that uses PyTorch, ideally without starting from scratch? A model that could handle point and segment occlusion would be nice. submitted by /u/head_robotics [link] [comments]  ( 42 min )
    [D] [R] Curious about Computer Vision to time series images(for example periodic satellite images of a region), what paper did you find most exciting/informative ?
    I'm curious about the intersection of CV/Deep-learning and time series data, particularly image data. Have you come across anything that you found to be effective/interesting methodologies ? submitted by /u/V1bicycle [link] [comments]  ( 42 min )
    ChatGPT is not all you need [R]
    Hi all, We would like to share here our little concise review of generative AI large models just to show how current models are able to work with lots of formats like texts, videos, images, etc... https://arxiv.org/abs/2301.04655 ​ Enjoy! submitted by /u/EduCGM [link] [comments]  ( 42 min )
    [D] OCR on some 'X' domain with different document layouts
    Is it a good idea to train a single OCR model to extract key value information from documents of same domain but with different layouts? Will it generalize? There are around ~1k different document layouts. submitted by /u/sanjeevr5 [link] [comments]  ( 41 min )
    [D] Resources for best practices on translating business questions into aggregated datasets?
    I'm in industry, and it seems like most project bottlenecks stem from getting from a vague business question to an aggregated/workable dataset to answer a more specific version of the initial business question. For example, given a question such as "We want to know CLV" (customer lifetime value) Since the above is too vague, what are "best practice" ways to rephrase this so that it's actually answerable? I.e. it could be framed as a binary classification problem to predict whether each customer will be worth at least X by Y date (> $1000 at 12 months) or a regression problem to predict the value of each customer at a future date given features we know today What's best practice for: How far into the past the where clause time window should be to get the customer features How far into the future the where clause time window grabs outcomes to join back to the current customer features? Does anyone know if there a resources that consolidate best practices or common approaches for the above scoping/experimentation questions? submitted by /u/what-is-neurotypical [link] [comments]  ( 42 min )
    [D] Badminton analysis using video input
    I’m starting with a project where I’m using camera or video input of a badminton game and use to analyse the game but i need help in starting with it as I’m in the beginning phase. Can anyone please help me with the same? submitted by /u/dark_lawd [link] [comments]  ( 42 min )
    [R] New Tsetlin machine learning scheme creates up to 80x smaller logical rules, benefitting hardware efficiency and interpretability.
    ​ Fine-grained control of the number and size of clauses. Paper: https://arxiv.org/abs/2301.08190 Code: https://github.com/cair/tmu Tsetlin machine (TM) is a logic-based machine learning approach with the crucial advantages of being transparent and hardware-friendly. While TMs match or surpass deep learning accuracy for an increasing number of applications, large clause pools tend to produce clauses with many literals (long clauses). As such, they become less interpretable. Further, longer clauses increase the switching activity of the clause logic in hardware, consuming more power. This paper introduces a novel variant of TM learning - Clause Size Constrained TMs (CSC-TMs) - where one can set a soft constraint on the clause size. As soon as a clause includes more literals than the constraint allows, it starts expelling literals. Accordingly, oversized clauses only appear transiently. To evaluate CSC-TM, we conduct classification, clustering, and regression experiments on tabular data, natural language text, images, and board games. Our results show that CSC-TM maintains accuracy with up to 80 times fewer literals. Indeed, the accuracy increases with shorter clauses for TREC, IMDb, and BBC Sports. After the accuracy peaks, it drops gracefully as the clause size approaches a single literal. We finally analyze CSC-TM power consumption and derive new convergence properties. submitted by /u/olegranmo [link] [comments]  ( 44 min )
    [P]Federated learning on edge devices
    I am working on a project i.e to build an android application using federated learning but I am unable to run federated learning on edge devices like android phones. I tried frameworks like Flower for it but I am unable to achieve the result. If you have worked on a project related to federated learning on edge devices please help me out. submitted by /u/Such-Reveal445 [link] [comments]  ( 42 min )
  • Open

    Airfoils
    Here’s something surprising: You can apply a symmetric function to a symmetric shape and get something out that is not symmetric. Let f(z) be the average of z and its reciprocal: f(z) = (z + 1/z)/2. This function is symmetric in that it sends z and 1/z to the same value. It’s also symmetric in […] Airfoils first appeared on John D. Cook.  ( 6 min )
  • Open

    An Empirical Proof of the Riemann Conjecture
    The correct term should be heuristic proof. It is not a formal proof from a mathematical point of view, but strong arguments based on empirical evidence. It is noteworthy enough that I decided to publish it. In this article I go straight to the point without discussing the concepts in details. The goal is to… Read More »An Empirical Proof of the Riemann Conjecture The post An Empirical Proof of the Riemann Conjecture appeared first on Data Science Central.  ( 22 min )
  • Open

    Data Models for Dataset Drift Controls in Machine Learning With Images. (arXiv:2211.02578v2 [cs.LG] UPDATED)
    Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.
    PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm. (arXiv:2208.07914v2 [cs.LG] UPDATED)
    Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
    Evaluating the Robustness of Trigger Set-Based Watermarks Embedded in Deep Neural Networks. (arXiv:2106.10147v2 [cs.CR] UPDATED)
    Trigger set-based watermarking schemes have gained emerging attention as they provide a means to prove ownership for deep neural network model owners. In this paper, we argue that state-of-the-art trigger set-based watermarking algorithms do not achieve their designed goal of proving ownership. We posit that this impaired capability stems from two common experimental flaws that the existing research practice has committed when evaluating the robustness of watermarking algorithms: (1) incomplete adversarial evaluation and (2) overlooked adaptive attacks. We conduct a comprehensive adversarial evaluation of 11 representative watermarking schemes against six of the existing attacks and demonstrate that each of these watermarking schemes lacks robustness against at least two non-adaptive attacks. We also propose novel adaptive attacks that harness the adversary's knowledge of the underlying watermarking algorithm of a target model. We demonstrate that the proposed attacks effectively break all of the 11 watermarking schemes, consequently allowing adversaries to obscure the ownership of any watermarked model. We encourage follow-up studies to consider our guidelines when evaluating the robustness of their watermarking schemes via conducting comprehensive adversarial evaluation that includes our adaptive attacks to demonstrate a meaningful upper bound of watermark robustness.
    Self-supervised Learning for Segmentation and Quantification of Dopamine Neurons in $\text{Parkinson's Disease}$. (arXiv:2301.08141v1 [cs.CV])
    $\text{Parkinson's Disease}$ (PD) is the second most common neurodegenerative disease in humans. PD is characterized by the gradual loss of dopaminergic neurons in the Substantia Nigra (a part of the mid-brain). Counting the number of dopaminergic neurons in the Substantia Nigra is one of the most important indexes in evaluating drug efficacy in PD animal models. Currently, analyzing and quantifying dopaminergic neurons is conducted manually by experts through analysis of digital pathology images which is laborious, time-consuming, and highly subjective. As such, a reliable and unbiased automated system is demanded for the quantification of dopaminergic neurons in digital pathology images. We propose an end-to-end deep learning framework for the segmentation and quantification of dopaminergic neurons in PD animal models. To the best of knowledge, this is the first machine learning model that detects the cell body of dopaminergic neurons, counts the number of dopaminergic neurons and provides the phenotypic characteristics of individual dopaminergic neurons as a numerical output. Extensive experiments demonstrate the effectiveness of our model in quantifying neurons with a high precision, which can provide quicker turnaround for drug efficacy studies, better understanding of dopaminergic neuronal health status and unbiased results in PD pre-clinical research.
    Convergence beyond the over-parameterized regime using Rayleigh quotients. (arXiv:2301.08117v1 [cs.LG])
    In this paper, we present a new strategy to prove the convergence of deep learning architectures to a zero training (or even testing) loss by gradient flow. Our analysis is centered on the notion of Rayleigh quotients in order to prove Kurdyka-{\L}ojasiewicz inequalities for a broader set of neural network architectures and loss functions. We show that Rayleigh quotients provide a unified view for several convergence analysis techniques in the literature. Our strategy produces a proof of convergence for various examples of parametric learning. In particular, our analysis does not require the number of parameters to tend to infinity, nor the number of samples to be finite, thus extending to test loss minimization and beyond the over-parameterized regime.
    Thermodynamics-informed neural networks for physically realistic mixed reality. (arXiv:2210.13414v2 [cs.GR] UPDATED)
    The imminent impact of immersive technologies in society urges for active research in real-time and interactive physics simulation for virtual worlds to be realistic. In this context, realistic means to be compliant to the laws of physics. In this paper we present a method for computing the dynamic response of (possibly non-linear and dissipative) deformable objects induced by real-time user interactions in mixed reality using deep learning. The graph-based architecture of the method ensures the thermodynamic consistency of the predictions, whereas the visualization pipeline allows a natural and realistic user experience. Two examples of virtual solids interacting with virtual or physical solids in mixed reality scenarios are provided to prove the performance of the method.
    Context-aware controller inference for stabilizing dynamical systems from scarce data. (arXiv:2207.11049v2 [math.OC] UPDATED)
    This work introduces a data-driven control approach for stabilizing high-dimensional dynamical systems from scarce data. The proposed context-aware controller inference approach is based on the observation that controllers need to act locally only on the unstable dynamics to stabilize systems. This means it is sufficient to learn the unstable dynamics alone, which are typically confined to much lower dimensional spaces than the high-dimensional state spaces of all system dynamics and thus few data samples are sufficient to identify them. Numerical experiments demonstrate that context-aware controller inference learns stabilizing controllers from orders of magnitude fewer data samples than traditional data-driven control techniques and variants of reinforcement learning. The experiments further show that the low data requirements of context-aware controller inference are especially beneficial in data-scarce engineering problems with complex physics, for which learning complete system dynamics is often intractable in terms of data and training costs.
    From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation from Single-Camera Teleoperation. (arXiv:2204.12490v2 [cs.RO] UPDATED)
    We propose to perform imitation learning for dexterous manipulation with multi-finger robot hand from human demonstrations, and transfer the policy to the real robot hand. We introduce a novel single-camera teleoperation system to collect the 3D demonstrations efficiently with only an iPad and a computer. One key contribution of our system is that we construct a customized robot hand for each user in the physical simulator, which is a manipulator resembling the same kinematics structure and shape of the operator's hand. This provides an intuitive interface and avoid unstable human-robot hand retargeting for data collection, leading to large-scale and high quality data. Once the data is collected, the customized robot hand trajectories can be converted to different specified robot hands (models that are manufactured) to generate training demonstrations. With imitation learning using our data, we show large improvement over baselines with multiple complex manipulation tasks. Importantly, we show our learned policy is significantly more robust when transferring to the real robot. More videos can be found in the https://yzqin.github.io/dex-teleop-imitation .
    Deep Learning for Breast MRI Style Transfer with Limited Training Data. (arXiv:2301.02069v1 [eess.IV] CROSS LISTED)
    In this work we introduce a novel medical image style transfer method, StyleMapper, that can transfer medical scans to an unseen style with access to limited training data. This is made possible by training our model on unlimited possibilities of simulated random medical imaging styles on the training set, making our work more computationally efficient when compared with other style transfer methods. Moreover, our method enables arbitrary style transfer: transferring images to styles unseen in training. This is useful for medical imaging, where images are acquired using different protocols and different scanner models, resulting in a variety of styles that data may need to be transferred between. Methods: Our model disentangles image content from style and can modify an image's style by simply replacing the style encoding with one extracted from a single image of the target style, with no additional optimization required. This also allows the model to distinguish between different styles of images, including among those that were unseen in training. We propose a formal description of the proposed model. Results: Experimental results on breast magnetic resonance images indicate the effectiveness of our method for style transfer. Conclusion: Our style transfer method allows for the alignment of medical images taken with different scanners into a single unified style dataset, allowing for the training of other downstream tasks on such a dataset for tasks such as classification, object detection and others.
    Time-Warping Invariant Quantum Recurrent Neural Networks via Quantum-Classical Adaptive Gating. (arXiv:2301.08173v1 [quant-ph])
    Adaptive gating plays a key role in temporal data processing via classical recurrent neural networks (RNN), as it facilitates retention of past information necessary to predict the future, providing a mechanism that preserves invariance to time warping transformations. This paper builds on quantum recurrent neural networks (QRNNs), a dynamic model with quantum memory, to introduce a novel class of temporal data processing quantum models that preserve invariance to time-warping transformations of the (classical) input-output sequences. The model, referred to as time warping-invariant QRNN (TWI-QRNN), augments a QRNN with a quantum-classical adaptive gating mechanism that chooses whether to apply a parameterized unitary transformation at each time step as a function of the past samples of the input sequence via a classical recurrent model. The TWI-QRNN model class is derived from first principles, and its capacity to successfully implement time-warping transformations is experimentally demonstrated on examples with classical or quantum dynamics.
    Optimizing Intermediate Representations of Generative Models for Phase Retrieval. (arXiv:2205.15617v2 [cs.LG] UPDATED)
    Phase retrieval is the problem of reconstructing images from magnitude-only measurements. In many real-world applications the problem is underdetermined. When training data is available, generative models allow optimization in a lower-dimensional latent space, hereby constraining the solution set to those images that can be synthesized by the generative model. However, not all possible solutions are within the range of the generator. Instead, they are represented with some error. To reduce this representation error in the context of phase retrieval, we first leverage a novel variation of intermediate layer optimization (ILO) to extend the range of the generator while still producing images consistent with the training data. Second, we introduce new initialization schemes that further improve the quality of the reconstruction. With extensive experiments on the Fourier phase retrieval problem and thorough ablation studies, we can show the benefits of our modified ILO and the new initialization schemes. Additionally, we analyze the performance of our approach on the Gaussian phase retrieval problem.
    Self-supervised Trajectory Representation Learning with Temporal Regularities and Travel Semantics. (arXiv:2211.09510v3 [cs.LG] UPDATED)
    Trajectory Representation Learning (TRL) is a powerful tool for spatial-temporal data analysis and management. TRL aims to convert complicated raw trajectories into low-dimensional representation vectors, which can be applied to various downstream tasks, such as trajectory classification, clustering, and similarity computation. Existing TRL works usually treat trajectories as ordinary sequence data, while some important spatial-temporal characteristics, such as temporal regularities and travel semantics, are not fully exploited. To fill this gap, we propose a novel Self-supervised trajectory representation learning framework with TemporAl Regularities and Travel semantics, namely START. The proposed method consists of two stages. The first stage is a Trajectory Pattern-Enhanced Graph Attention Network (TPE-GAT), which converts the road network features and travel semantics into representation vectors of road segments. The second stage is a Time-Aware Trajectory Encoder (TAT-Enc), which encodes representation vectors of road segments in the same trajectory as a trajectory representation vector, meanwhile incorporating temporal regularities with the trajectory representation. Moreover, we also design two self-supervised tasks, i.e., span-masked trajectory recovery and trajectory contrastive learning, to introduce spatial-temporal characteristics of trajectories into the training process of our START framework. The effectiveness of the proposed method is verified by extensive experiments on two large-scale real-world datasets for three downstream tasks. The experiments also demonstrate that our method can be transferred across different cities to adapt heterogeneous trajectory datasets.
    Efficient Pricing and Hedging of High Dimensional American Options Using Recurrent Networks. (arXiv:2301.08232v1 [q-fin.MF])
    We propose a deep Recurrent neural network (RNN) framework for computing prices and deltas of American options in high dimensions. Our proposed framework uses two deep RNNs, where one network learns the price and the other learns the delta of the option for each timestep. Our proposed framework yields prices and deltas for the entire spacetime, not only at a given point (e.g. t = 0). The computational cost of the proposed approach is linear in time, which improves on the quadratic time seen for feedforward networks that price American options. The computational memory cost of our method is constant in memory, which is an improvement over the linear memory costs seen in feedforward networks. Our numerical simulations demonstrate these contributions, and show that the proposed deep RNN framework is computationally more efficient than traditional feedforward neural network frameworks in time and memory.
    Characterizing the Spectrum of the NTK via a Power Series Expansion. (arXiv:2211.07844v2 [cs.LG] UPDATED)
    Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.
    On the Vulnerability of Backdoor Defenses for Federated Learning. (arXiv:2301.08170v1 [cs.LG])
    Federated Learning (FL) is a popular distributed machine learning paradigm that enables jointly training a global model without sharing clients' data. However, its repetitive server-client communication gives room for backdoor attacks with aim to mislead the global model into a targeted misprediction when a specific trigger pattern is presented. In response to such backdoor threats on federated learning, various defense measures have been proposed. In this paper, we study whether the current defense mechanisms truly neutralize the backdoor threats from federated learning in a practical setting by proposing a new federated backdoor attack method for possible countermeasures. Different from traditional training (on triggered data) and rescaling (the malicious client model) based backdoor injection, the proposed backdoor attack framework (1) directly modifies (a small proportion of) local model weights to inject the backdoor trigger via sign flips; (2) jointly optimize the trigger pattern with the client model, thus is more persistent and stealthy for circumventing existing defenses. In a case study, we examine the strength and weaknesses of recent federated backdoor defenses from three major categories and provide suggestions to the practitioners when training federated models in practice.
    Simultaneously Learning Robust Audio Embeddings and balanced Hash codes for Query-by-Example. (arXiv:2211.11060v2 [eess.AS] UPDATED)
    Audio fingerprinting systems must efficiently and robustly identify query snippets in an extensive database. To this end, state-of-the-art systems use deep learning to generate compact audio fingerprints. These systems deploy indexing methods, which quantize fingerprints to hash codes in an unsupervised manner to expedite the search. However, these methods generate imbalanced hash codes, leading to their suboptimal performance. Therefore, we propose a self-supervised learning framework to compute fingerprints and balanced hash codes in an end-to-end manner to achieve both fast and accurate retrieval performance. We model hash codes as a balanced clustering process, which we regard as an instance of the optimal transport problem. Experimental results indicate that the proposed approach improves retrieval efficiency while preserving high accuracy, particularly at high distortion levels, compared to the competing methods. Moreover, our system is efficient and scalable in computational load and memory storage.
    Global mapping of fragmented rocks on the Moon with a neural network: Implications for the failure mode of rocks on airless surfaces. (arXiv:2301.08151v1 [astro-ph.EP])
    It has been recently recognized that the surface of sub-km asteroids in contact with the space environment is not fine-grained regolith but consists of centimeter to meter-scale rocks. Here we aim to understand how the rocky morphology of minor bodies react to the well known space erosion agents on the Moon. We deploy a neural network and map a total of ~130,000 fragmented boulders scattered across the lunar surface and visually identify a dozen different desintegration morphologies corresponding to different failure modes. We find that several fragmented boulder morphologies are equivalent to morphologies observed on asteroid Bennu, suggesting that these morphologies on the Moon and on asteroids are likely not diagnostic of their formation mechanism. Our findings suggest that the boulder fragmentation process is characterized by an internal weakening period with limited morphological signs of damage at rock scale until a sudden highly efficient impact shattering event occurs. In addition, we identify new morphologies such as breccia boulders with an advection-like erosion style. We publicly release the produced fractured boulder catalog along with this paper.
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v3 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and an (2, 1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, complementary to classic compression via weight pruning. Source code is available at https://github.com/rkwitt/excess_capacity.
    A Deep Double Ritz Method (D$^2$RM) for solving Partial Differential Equations using Neural Networks. (arXiv:2211.03627v3 [math.NA] UPDATED)
    Residual minimization is a widely used technique for solving Partial Differential Equations in variational form. It minimizes the dual norm of the residual, which naturally yields a saddle-point (min-max) problem over the so-called trial and test spaces. In the context of neural networks, we can address this min-max approach by employing one network to seek the trial minimum, while another network seeks the test maximizers. However, the resulting method is numerically unstable as we approach the trial solution. To overcome this, we reformulate the residual minimization as an equivalent minimization of a Ritz functional fed by optimal test functions computed from another Ritz functional minimization. We call the resulting scheme the Deep Double Ritz Method (D$^2$RM), which combines two neural networks for approximating trial functions and optimal test functions along a nested double Ritz minimization strategy. Numerical results on different diffusion and convection problems support the robustness of our method, up to the approximation properties of the networks and the training capacity of the optimizers.
    Enhancing Deep Learning with Scenario-Based Override Rules: a Case Study. (arXiv:2301.08114v1 [cs.SE])
    Deep neural networks (DNNs) have become a crucial instrument in the software development toolkit, due to their ability to efficiently solve complex problems. Nevertheless, DNNs are highly opaque, and can behave in an unexpected manner when they encounter unfamiliar input. One promising approach for addressing this challenge is by extending DNN-based systems with hand-crafted override rules, which override the DNN's output when certain conditions are met. Here, we advocate crafting such override rules using the well-studied scenario-based modeling paradigm, which produces rules that are simple, extensible, and powerful enough to ensure the safety of the DNN, while also rendering the system more translucent. We report on two extensive case studies, which demonstrate the feasibility of the approach; and through them, propose an extension to scenario-based modeling, which facilitates its integration with DNN components. We regard this work as a step towards creating safer and more reliable DNN-based systems and models.
    Dual Personalization on Federated Recommendation. (arXiv:2301.08143v1 [cs.IR])
    Federated recommendation is a new Internet service architecture that aims to provide privacy-preserving recommendation services in federated settings. Existing solutions are used to combine distributed recommendation algorithms and privacy-preserving mechanisms. Thus it inherently takes the form of heavyweight models at the server and hinders the deployment of on-device intelligent models to end-users. This paper proposes a novel Personalized Federated Recommendation (PFedRec) framework to learn many user-specific lightweight models to be deployed on smart devices rather than a heavyweight model on a server. Moreover, we propose a new dual personalization mechanism to effectively learn fine-grained personalization on both users and items. The overall learning process is formulated into a unified federated optimization framework. Specifically, unlike previous methods that share exactly the same item embeddings across users in a federated system, dual personalization allows mild finetuning of item embeddings for each user to generate user-specific views for item representations which can be integrated into existing federated recommendation methods to gain improvements immediately. Experiments on multiple benchmark datasets have demonstrated the effectiveness of PFedRec and the dual personalization mechanism. Moreover, we provide visualizations and in-depth analysis of the personalization techniques in item embedding, which shed novel insights on the design of RecSys in federated settings.
    EPiC-GAN: Equivariant Point Cloud Generation for Particle Jets. (arXiv:2301.08128v1 [hep-ph])
    With the vast data-collecting capabilities of current and future high-energy collider experiments, there is an increasing demand for computationally efficient simulations. Generative machine learning models enable fast event generation, yet so far these approaches are largely constrained to fixed data structures and rigid detector geometries. In this paper, we introduce EPiC-GAN - equivariant point cloud generative adversarial network - which can produce point clouds of variable multiplicity. This flexible framework is based on deep sets and is well suited for simulating sprays of particles called jets. The generator and discriminator utilize multiple EPiC layers with an interpretable global latent vector. Crucially, the EPiC layers do not rely on pairwise information sharing between particles, which leads to a significant speed-up over graph- and transformer-based approaches with more complex relation diagrams. We demonstrate that EPiC-GAN scales well to large particle multiplicities and achieves high generation fidelity on benchmark jet generation tasks.
    Score-based Causal Representation Learning with Interventions. (arXiv:2301.08230v1 [stat.ML])
    This paper studies causal representation learning problem when the latent causal variables are observed indirectly through an unknown linear transformation. The objectives are: (i) recovering the unknown linear transformation (up to scaling and ordering), and (ii) determining the directed acyclic graph (DAG) underlying the latent variables. Since identifiable representation learning is impossible based on only observational data, this paper uses both observational and interventional data. The interventional data is generated under distinct single-node randomized hard and soft interventions. These interventions are assumed to cover all nodes in the latent space. It is established that the latent DAG structure can be recovered under soft randomized interventions via the following two steps. First, a set of transformation candidates is formed by including all inverting transformations corresponding to which the \emph{score} function of the transformed variables has the minimal number of coordinates that change between an interventional and the observational environment summed over all pairs. Subsequently, this set is distilled using a simple constraint to recover the latent DAG structure. For the special case of hard randomized interventions, with an additional hypothesis testing step, one can also uniquely recover the linear transformation, up to scaling and a valid causal ordering. These results generalize the recent results that either assume deterministic hard interventions or linear causal relationships in the latent space.
    Soft-labeling Strategies for Rapid Sub-Typing. (arXiv:2209.12684v2 [cs.LG] UPDATED)
    The challenge of labeling large example datasets for computer vision continues to limit the availability and scope of image repositories. This research provides a new method for automated data collection, curation, labeling, and iterative training with minimal human intervention for the case of overhead satellite imagery and object detection. The new operational scale effectively scanned an entire city (68 square miles) in grid search and yielded a prediction of car color from space observations. A partially trained yolov5 model served as an initial inference seed to output further, more refined model predictions in iterative cycles. Soft labeling here refers to accepting label noise as a potentially valuable augmentation to reduce overfitting and enhance generalized predictions to previously unseen test data. The approach takes advantage of a real-world instance where a cropped image of a car can automatically receive sub-type information as white or colorful from pixel values alone, thus completing an end-to-end pipeline without overdependence on human labor.
    A Convenient Infinite Dimensional Framework for Generative Adversarial Learning. (arXiv:2011.12087v4 [cs.LG] UPDATED)
    In recent years, generative adversarial networks (GANs) have demonstrated impressive experimental results while there are only a few works that foster statistical learning theory for GANs. In this work, we propose an infinite dimensional theoretical framework for generative adversarial learning. We assume that the probability density functions of the underlying measure are uniformly bounded, $k$-times $\alpha$-H\"{o}lder differentiable ($C^{k,\alpha}$) and uniformly bounded away from zero. Under these assumptions, we show that the Rosenblatt transformation induces an optimal generator, which is realizable in the hypothesis space of $C^{k,\alpha}$-generators. With a consistent definition of the hypothesis space of discriminators, we further show that the Jensen-Shannon divergence between the distribution induced by the generator from the adversarial learning procedure and the data generating distribution converges to zero. Under certain regularity assumptions on the density of the data generating process, we also provide rates of convergence based on chaining and concentration.
    Scalable Causal Structure Learning: Scoping Review of Traditional and Deep Learning Algorithms and New Opportunities in Biomedicine. (arXiv:2110.07785v2 [cs.LG] UPDATED)
    Causal structure learning refers to a process of identifying causal structures from observational data, and it can have multiple applications in biomedicine and health care. This paper provides a practical review and tutorial on scalable causal structure learning models with examples of real-world data to help health care audiences understand and apply them. We reviewed traditional (combinatorial and score-based methods) for causal structure discovery and machine learning-based schemes. We also highlighted recent developments in biomedicine where causal structure learning can be applied to discover structures such as gene networks, brain connectivity networks, and those in cancer epidemiology. We also compared the performance of traditional and machine learning-based algorithms for causal discovery over some benchmark data sets. Machine learning-based approaches, including deep learning, have many advantages over traditional approaches, such as scalability, including a greater number of variables, and potentially being applied in a wide range of biomedical applications, such as genetics, if sufficient data are available. Furthermore, these models are more flexible than traditional models and are poised to positively affect many applications in the future.
    An Analysis of Semantically-Aligned Speech-Text Embeddings. (arXiv:2204.01235v2 [cs.CL] UPDATED)
    Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their cross-modal counterparts are less understood. In this work, we study some intrinsic properties of a joint speech-text embedding space, constructed by minimizing the distance between paired utterance and transcription inputs in a teacher-student model setup, that are informative for several prominent use cases. We found that incorporating automatic speech recognition through both pretraining and multitask scenarios aid semantic alignment significantly, resulting in more tightly coupled embeddings. To analyse cross-modal embeddings we utilise a quantitative retrieval accuracy metric for semantic alignment, zero-shot classification for generalisability, and probing of the encoders to observe the extent of knowledge transfer from one modality to another.
    FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning. (arXiv:2108.06098v3 [cs.LG] UPDATED)
    In this work, we propose a communication-efficient parameterization, FedPara, for federated learning (FL) to overcome the burdens on frequent model uploads and downloads. Our method re-parameterizes weight parameters of layers using low-rank weights followed by the Hadamard product. Compared to the conventional low-rank parameterization, our FedPara method is not restricted to low-rank constraints, and thereby it has a far larger capacity. This property enables to achieve comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers, which is not achievable by the traditional low-rank methods. The efficiency of our method can be further improved by combining with other efficient FL optimizers. In addition, we extend our method to a personalized FL application, pFedPara, which separates parameters into global and local ones. We show that pFedPara outperforms competing personalized FL methods with more than three times fewer parameters.
    Dimensionality Reduction using Elastic Measures. (arXiv:2209.04933v3 [cs.LG] UPDATED)
    With the recent surge in big data analytics for hyper-dimensional data there is a renewed interest in dimensionality reduction techniques for machine learning applications. In order for these methods to improve performance gains and understanding of the underlying data, a proper metric needs to be identified. This step is often overlooked and metrics are typically chosen without consideration of the underlying geometry of the data. In this paper, we present a method for incorporating elastic metrics into the t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). We apply our method to functional data, which is uniquely characterized by rotations, parameterization, and scale. If these properties are ignored, they can lead to incorrect analysis and poor classification performance. Through our method we demonstrate improved performance on shape identification tasks for three benchmark data sets (MPEG-7, Car data set, and Plane data set of Thankoor), where we achieve 0.77, 0.95, and 1.00 F1 score, respectively.
    DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies. (arXiv:2301.08164v1 [cs.LG])
    We introduce an information-theoretic quantity with similar properties to mutual information that can be estimated from data without making explicit assumptions on the underlying distribution. This quantity is based on a recently proposed matrix-based entropy that uses the eigenvalues of a normalized Gram matrix to compute an estimate of the eigenvalues of an uncentered covariance operator in a reproducing kernel Hilbert space. We show that a difference of matrix-based entropies (DiME) is well suited for problems involving maximization of mutual information between random variables. While many methods for such tasks can lead to trivial solutions, DiME naturally penalizes such outcomes. We provide several examples of use cases for the proposed quantity including a multi-view representation learning problem where DiME is used to encourage learning a shared representation among views with high mutual information. We also show the versatility of DiME by using it as objective function for a variety of tasks.
    Hamiltonian Neural Networks with Automatic Symmetry Detection. (arXiv:2301.07928v1 [cs.LG])
    Recently, Hamiltonian neural networks (HNN) have been introduced to incorporate prior physical knowledge when learning the dynamical equations of Hamiltonian systems. Hereby, the symplectic system structure is preserved despite the data-driven modeling approach. However, preserving symmetries requires additional attention. In this research, we enhance the HNN with a Lie algebra framework to detect and embed symmetries in the neural network. This approach allows to simultaneously learn the symmetry group action and the total energy of the system. As illustrating examples, a pendulum on a cart and a two-body problem from astrodynamics are considered.
    AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. (arXiv:2301.08110v1 [cs.LG])
    Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.
    DynInt: Dynamic Interaction Modeling for Large-scale Click-Through Rate Prediction. (arXiv:2301.08139v1 [cs.IR])
    Learning feature interactions is the key to success for the large-scale CTR prediction in Ads ranking and recommender systems. In industry, deep neural network-based models are widely adopted for modeling such problems. Researchers proposed various neural network architectures for searching and modeling the feature interactions in an end-to-end fashion. However, most methods only learn static feature interactions and have not fully leveraged deep CTR models' representation capacity. In this paper, we propose a new model: DynInt. By extending Polynomial-Interaction-Network (PIN), which learns higher-order interactions recursively to be dynamic and data-dependent, DynInt further derived two modes for modeling dynamic higher-order interactions: dynamic activation and dynamic parameter. In dynamic activation mode, we adaptively adjust the strength of learned interactions by instance-aware activation gating networks. In dynamic parameter mode, we re-parameterize the parameters by different formulations and dynamically generate the parameters by instance-aware parameter generation networks. Through instance-aware gating mechanism and dynamic parameter generation, we enable the PIN to model dynamic interaction for potential industry applications. We implement the proposed model and evaluate the model performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of DynInt over state-of-the-art models.
    Fast Vision Transformers with HiLo Attention. (arXiv:2205.13213v4 [cs.CV] UPDATED)
    Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and 1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/ziplab/LITv2.
    The secret role of undesired physical effects in accurate shape sensing with eccentric FBGs. (arXiv:2210.16316v2 [cs.LG] UPDATED)
    Fiber optic shape sensors have enabled unique advances in various navigation tasks, from medical tool tracking to industrial applications. Eccentric fiber Bragg gratings (FBG) are cheap and easy-to-fabricate shape sensors that are often interrogated with simple setups. However, using low-cost interrogation systems for such intensity-based quasi-distributed sensors introduces further complications to the sensor's signal. Therefore, eccentric FBGs have not been able to accurately estimate complex multi-bend shapes. Here, we present a novel technique to overcome these limitations and provide accurate and precise shape estimation in eccentric FBG sensors. We investigate the most important bending-induced effects in curved optical fibers that are usually eliminated in intensity-based fiber sensors. These effects contain shape deformation information with a higher spatial resolution that we are now able to extract using deep learning techniques. We design a deep learning model based on a convolutional neural network that is trained to predict shapes given the sensor's spectra. We also provide a visual explanation, highlighting wavelength elements whose intensities are more relevant in making shape predictions. These findings imply that deep learning techniques benefit from the bending-induced effects that impact the desired signal in a complex manner. This is the first step toward cheap yet accurate fiber shape sensing solutions.
    Learning programs by combining programs. (arXiv:2206.01614v2 [cs.LG] UPDATED)
    The goal of inductive logic programming is to induce a logic program (a set of logical rules) that generalises training examples. Inducing programs with many rules and literals is a major challenge. To tackle this challenge, we introduce an approach where we learn small non-separable programs and combine them. We implement our approach in a constraint-driven ILP system. Our approach can learn optimal and recursive programs and perform predicate invention. Our experiments on multiple domains, including game playing and program synthesis, show that our approach can drastically outperform existing approaches in terms of predictive accuracies and learning times, sometimes reducing learning times from over an hour to a few seconds.
    Diffusion-based Conditional ECG Generation with Structured State Space Models. (arXiv:2301.08227v1 [eess.SP])
    Synthetic data generation is a promising solution to address privacy issues with the distribution of sensitive health data. Recently, diffusion models have set new standards for generative models for different data modalities. Also very recently, structured state space models emerged as a powerful modeling paradigm to capture long-term dependencies in time series. We put forward SSSD-ECG, as the combination of these two technologies, for the generation of synthetic 12-lead electrocardiograms conditioned on more than 70 ECG statements. Due to a lack of reliable baselines, we also propose conditional variants of two state-of-the-art unconditional generative models. We thoroughly evaluate the quality of the generated samples, by evaluating pretrained classifiers on the generated data and by evaluating the performance of a classifier trained only on synthetic data, where SSSD-ECG clearly outperforms its GAN-based competitors. We demonstrate the soundness of our approach through further experiments, including conditional class interpolation and a clinical Turing test demonstrating the high quality of the SSSD-ECG samples across a wide range of conditions.
    Graph Data Augmentation for Graph Machine Learning: A Survey. (arXiv:2202.08871v2 [cs.LG] UPDATED)
    Data augmentation has recently seen increased interest in graph machine learning given its demonstrated ability to improve model performance and generalization by added training data. Despite this recent surge, the area is still relatively under-explored, due to the challenges brought by complex, non-Euclidean structure of graph data, which limits the direct analogizing of traditional augmentation operations on other types of image, video or text data. Our work aims to give a necessary and timely overview of existing graph data augmentation methods; notably, we present a comprehensive and systematic survey of graph data augmentation approaches, summarizing the literature in a structured manner. We first introduce three different taxonomies for categorizing graph data augmentation methods from the data, task, and learning perspectives, respectively. Next, we introduce recent advances in graph data augmentation, differentiated by their methodologies and applications. We conclude by outlining currently unsolved challenges and directions for future research. Overall, our work aims to clarify the landscape of existing literature in graph data augmentation and motivates additional work in this area, providing a helpful resource for researchers and practitioners in the broader graph machine learning domain. Additionally, we provide a continuously updated reading list at https://github.com/zhao-tong/graph-data-augmentation-papers.
    Job recommendations: benchmarking of collaborative filtering methods for classifieds. (arXiv:2301.07946v1 [cs.IR])
    Classifieds provide many challenges for recommendation methods, due to the limited information regarding users and items. In this paper, we explore recommendation methods for classifieds using the example of OLX Jobs. The goal of the paper is to benchmark different recommendation methods for jobs classifieds in order to improve advertisements' conversion rate and user satisfaction. In our research, we implemented methods that are scalable and represent different approaches to recommendation, namely ALS, LightFM, Prod2Vec, RP3beta, and SLIM. We performed a laboratory comparison of methods with regard to accuracy, diversity, and scalability (memory and time consumption during training and in prediction). Online A/B tests were also carried out by sending millions of messages with recommendations to evaluate models in a real-world setting. In addition, we have published the dataset that we created for the needs of our research. To the best of our knowledge, this is the first dataset of this kind. The dataset contains 65,502,201 events performed on OLX Jobs by 3,295,942 users, who interacted with (displayed, replied to, or bookmarked) 185,395 job ads in two weeks of 2020. We demonstrate that RP3beta, SLIM, and ALS perform significantly better than Prod2Vec and LightFM when tested in a laboratory setting. Online A/B tests also demonstrated that sending messages with recommendations generated by the ALS and RP3beta models increases the number of users contacting advertisers. Additionally, RP3beta had a 20% greater impact on this metric than ALS.
    Everything is Connected: Graph Neural Networks. (arXiv:2301.08210v1 [cs.LG])
    In many ways, graphs are the main modality of data we receive from nature. This is due to the fact that most of the patterns we see, both in natural and artificial systems, are elegantly representable using the language of graph structures. Prominent examples include molecules (represented as graphs of atoms and bonds), social networks and transportation networks. This potential has already been seen by key scientific and industrial groups, with already-impacted application areas including traffic forecasting, drug discovery, social network analysis and recommender systems. Further, some of the most successful domains of application for machine learning in previous years -- images, text and speech processing -- can be seen as special cases of graph representation learning, and consequently there has been significant exchange of information between these areas. The main aim of this short survey is to enable the reader to assimilate the key concepts in the area, and position graph representation learning in a proper context with related fields.
    Automated deep reinforcement learning for real-time scheduling strategy of multi-energy system integrated with post-carbon and direct-air carbon captured system. (arXiv:2301.07768v1 [eess.SY])
    The carbon-capturing process with the aid of CO2 removal technology (CDRT) has been recognised as an alternative and a prominent approach to deep decarbonisation. However, the main hindrance is the enormous energy demand and the economic implication of CDRT if not effectively managed. Hence, a novel deep reinforcement learning agent (DRL), integrated with an automated hyperparameter selection feature, is proposed in this study for the real-time scheduling of a multi-energy system coupled with CDRT. Post-carbon capture systems (PCCS) and direct-air capture systems (DACS) are considered CDRT. Various possible configurations are evaluated using real-time multi-energy data of a district in Arizona and CDRT parameters from manufacturers' catalogues and pilot project documentation. The simulation results validate that an optimised soft-actor critic (SAC) algorithm outperformed the TD3 algorithm due to its maximum entropy feature. We then trained four (4) SAC agents, equivalent to the number of considered case studies, using optimised hyperparameter values and deployed them in real time for evaluation. The results show that the proposed DRL agent can meet the prosumers' multi-energy demand and schedule the CDRT energy demand economically without specified constraints violation. Also, the proposed DRL agent outperformed rule-based scheduling by 23.65%. However, the configuration with PCCS and solid-sorbent DACS is considered the most suitable configuration with a high CO2 captured-released ratio of 38.54, low CO2 released indicator value of 2.53, and a 36.5% reduction in CDR cost due to waste heat utilisation and high absorption capacity of the selected sorbent. However, the adoption of CDRT is not economically viable at the current carbon price. Finally, we showed that CDRT would be attractive at a carbon price of 400-450USD/ton with the provision of tax incentives by the policymakers.
    Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity -- the Strongly Convex Case. (arXiv:2301.08039v1 [math.PR])
    In this article we consider sampling from log concave distributions in Hamiltonian setting, without assuming that the objective gradient is globally Lipschitz. We propose two algorithms based on monotone polygonal (tamed) Euler schemes, to sample from a target measure, and provide non-asymptotic 2-Wasserstein distance bounds between the law of the process of each algorithm and the target measure. Finally, we apply these results to bound the excess risk optimization error of the associated optimization problem.
    A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs. (arXiv:2301.08187v1 [stat.ML])
    U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.
    What's happening in your neighborhood? A Weakly Supervised Approach to Detect Local News. (arXiv:2301.08146v1 [cs.IR])
    Local news articles are a subset of news that impact users in a geographical area, such as a city, county, or state. Detecting local news (Step 1) and subsequently deciding its geographical location as well as radius of impact (Step 2) are two important steps towards accurate local news recommendation. Naive rule-based methods, such as detecting city names from the news title, tend to give erroneous results due to lack of understanding of the news content. Empowered by the latest development in natural language processing, we develop an integrated pipeline that enables automatic local news detection and content-based local news recommendations. In this paper, we focus on Step 1 of the pipeline, which highlights: (1) a weakly supervised framework incorporated with domain knowledge and auto data processing, and (2) scalability to multi-lingual settings. Compared with Stanford CoreNLP NER model, our pipeline has higher precision and recall evaluated on a real-world and human-labeled dataset. This pipeline has potential to more precise local news to users, helps local businesses get more exposure, and gives people more information about their neighborhood safety.
    On backpropagating Hessians through ODEs. (arXiv:2301.08085v1 [math.OC])
    We discuss the problem of numerically backpropagating Hessians through ordinary differential equations (ODEs) in various contexts and elucidate how different approaches may be favourable in specific situations. We discuss both theoretical and pragmatic aspects such as, respectively, bounds on computational effort and typical impact of framework overhead. Focusing on the approach of hand-implemented ODE-backpropagation, we develop the computation for the Hessian of orbit-nonclosure for a mechanical system. We also clarify the mathematical framework for extending the backward-ODE-evolution of the costate-equation to Hessians, in its most generic form. Some calculations, such as that of the Hessian for orbit non-closure, are performed in a language, defined in terms of a formal grammar, that we introduce to facilitate the tracking of intermediate quantities. As pedagogical examples, we discuss the Hessian of orbit-nonclosure for the higher dimensional harmonic oscillator and conceptually related problems in Newtonian gravitational theory. In particular, applying our approach to the figure-8 three-body orbit, we readily rediscover a distorted-figure-8 solution originally described by Sim\'o. Possible applications may include: improvements to training of `neural ODE'- type deep learning with second-order methods, numerical analysis of quantum corrections around classical paths, and, more broadly, studying options for adjusting an ODE's initial configuration such that the impact on some given objective function is small.
    Shapley Values with Uncertain Value Functions. (arXiv:2301.08086v1 [cs.LG])
    We propose a novel definition of Shapley values with uncertain value functions based on first principles using probability theory. Such uncertain value functions can arise in the context of explainable machine learning as a result of non-deterministic algorithms. We show that random effects can in fact be absorbed into a Shapley value with a noiseless but shifted value function. Hence, Shapley values with uncertain value functions can be used in analogy to regular Shapley values. However, their reliable evaluation typically requires more computational effort.
    Differentially Private Online Bayesian Estimation With Adaptive Truncation. (arXiv:2301.08202v1 [cs.LG])
    We propose a novel online and adaptive truncation method for differentially private Bayesian online estimation of a static parameter regarding a population. We assume that sensitive information from individuals is collected sequentially and the inferential aim is to estimate, on-the-fly, a static parameter regarding the population to which those individuals belong. We propose sequential Monte Carlo to perform online Bayesian estimation. When individuals provide sensitive information in response to a query, it is necessary to perturb it with privacy-preserving noise to ensure the privacy of those individuals. The amount of perturbation is proportional to the sensitivity of the query, which is determined usually by the range of the queried information. The truncation technique we propose adapts to the previously collected observations to adjust the query range for the next individual. The idea is that, based on previous observations, we can carefully arrange the interval into which the next individual's information is to be truncated before being perturbed with privacy-preserving noise. In this way, we aim to design predictive queries with small sensitivity, hence small privacy-preserving noise, enabling more accurate estimation while maintaining the same level of privacy. To decide on the location and the width of the interval, we use an exploration-exploitation approach a la Thompson sampling with an objective function based on the Fisher information of the generated observation. We show the merits of our methodology with numerical examples.
    Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions. (arXiv:2301.07966v1 [cs.LG])
    One surprising trait of neural networks is the extent to which their connections can be pruned with little to no effect on accuracy. But when we cross a critical level of parameter sparsity, pruning any further leads to a sudden drop in accuracy. This drop plausibly reflects a loss in model complexity, which we aim to avoid. In this work, we explore how sparsity also affects the geometry of the linear regions defined by a neural network, and consequently reduces the expected maximum number of linear regions based on the architecture. We observe that pruning affects accuracy similarly to how sparsity affects the number of linear regions and our proposed bound for the maximum number. Conversely, we find out that selecting the sparsity across layers to maximize our bound very often improves accuracy in comparison to pruning as much with the same sparsity in all layers, thereby providing us guidance on where to prune.
    BO-DBA: Query-Efficient Decision-Based Adversarial Attacks via Bayesian Optimization. (arXiv:2106.02732v2 [cs.LG] UPDATED)
    Decision-based attacks (DBA), wherein attackers perturb inputs to spoof learning algorithms by observing solely the output labels, are a type of severe adversarial attacks against Deep Neural Networks (DNNs) requiring minimal knowledge of attackers. State-of-the-art DBA attacks relying on zeroth-order gradient estimation require an excessive number of queries. Recently, Bayesian optimization (BO) has shown promising in reducing the number of queries in score-based attacks (SBA), in which attackers need to observe real-valued probability scores as outputs. However, extending BO to the setting of DBA is nontrivial because in DBA only output labels instead of real-valued scores, as needed by BO, are available to attackers. In this paper, we close this gap by proposing an efficient DBA attack, namely BO-DBA. Different from existing approaches, BO-DBA generates adversarial examples by searching so-called \emph{directions of perturbations}. It then formulates the problem as a BO problem that minimizes the real-valued distortion of perturbations. With the optimized perturbation generation process, BO-DBA converges much faster than the state-of-the-art DBA techniques. Experimental results on pre-trained ImageNet classifiers show that BO-DBA converges within 200 queries while the state-of-the-art DBA techniques need over 15,000 queries to achieve the same level of perturbation distortion. BO-DBA also shows similar attack success rates even as compared to BO-based SBA attacks but with less distortion.
    Hybrid thermal modeling of additive manufacturing processes using physics-informed neural networks for temperature prediction and parameter identification. (arXiv:2206.07756v2 [cs.LG] UPDATED)
    Understanding the thermal behavior of additive manufacturing (AM) processes is crucial for enhancing the quality control and enabling customized process design. Most purely physics-based computational models suffer from intensive computational costs and the need of calibrating unknown parameters, thus not suitable for online control and iterative design application. Data-driven models taking advantage of the latest developed computational tools can serve as a more efficient surrogate, but they are usually trained over a large amount of simulation data and often fail to effectively use small but high-quality experimental data. In this work, we developed a hybrid physics-based data-driven thermal modeling approach of AM processes using physics-informed neural networks. Specifically, partially observed temperature data measured from an infrared camera is combined with the physics laws to predict full-field temperature history and to discover unknown material and process parameters. In the numerical and experimental examples, the effectiveness of adding auxiliary training data and using the pretrained model on training efficiency and prediction accuracy, as well as the ability to identify unknown parameters with partially observed data, are demonstrated. The results show that the hybrid thermal model can effectively identify unknown parameters and capture the full-field temperature accurately, and thus it has the potential to be used in iterative process design and real-time process control of AM.
    Music Playlist Title Generation Using Artist Information. (arXiv:2301.08145v1 [cs.IR])
    Automatically generating or captioning music playlist titles given a set of tracks is of significant interest in music streaming services as customized playlists are widely used in personalized music recommendation, and well-composed text titles attract users and help their music discovery. We present an encoder-decoder model that generates a playlist title from a sequence of music tracks. While previous work takes track IDs as tokenized input for playlist title generation, we use artist IDs corresponding to the tracks to mitigate the issue from the long-tail distribution of tracks included in the playlist dataset. Also, we introduce a chronological data split method to deal with newly-released tracks in real-world scenarios. Comparing the track IDs and artist IDs as input sequences, we show that the artist-based approach significantly enhances the performance in terms of word overlap, semantic relevance, and diversity.
    GIPA++: A General Information Propagation Algorithm for Graph Learning. (arXiv:2301.08209v1 [cs.LG])
    Graph neural networks (GNNs) have been widely used in graph-structured data computation, showing promising performance in various applications such as node classification, link prediction, and network recommendation. Existing works mainly focus on node-wise correlation when doing weighted aggregation of neighboring nodes based on attention, such as dot product by the dense vectors of two nodes. This may cause conflicting noise in nodes to be propagated when doing information propagation. To solve this problem, we propose a General Information Propagation Algorithm (GIPA in short), which exploits more fine-grained information fusion including bit-wise and feature-wise correlations based on edge features in their propagation. Specifically, the bit-wise correlation calculates the element-wise attention weight through a multi-layer perceptron (MLP) based on the dense representations of two nodes and their edge; The feature-wise correlation is based on the one-hot representations of node attribute features for feature selection. We evaluate the performance of GIPA on the Open Graph Benchmark proteins (OGBN-proteins for short) dataset and the Alipay dataset of Alibaba. Experimental results reveal that GIPA outperforms the state-of-the-art models in terms of prediction accuracy, e.g., GIPA achieves an average ROC-AUC of $0.8901\pm 0.0011$, which is better than that of all the existing methods listed in the OGBN-proteins leaderboard.
    Building Concise Logical Patterns by Constraining Tsetlin Machine Clause Size. (arXiv:2301.08190v1 [cs.LG])
    Tsetlin machine (TM) is a logic-based machine learning approach with the crucial advantages of being transparent and hardware-friendly. While TMs match or surpass deep learning accuracy for an increasing number of applications, large clause pools tend to produce clauses with many literals (long clauses). As such, they become less interpretable. Further, longer clauses increase the switching activity of the clause logic in hardware, consuming more power. This paper introduces a novel variant of TM learning - Clause Size Constrained TMs (CSC-TMs) - where one can set a soft constraint on the clause size. As soon as a clause includes more literals than the constraint allows, it starts expelling literals. Accordingly, oversized clauses only appear transiently. To evaluate CSC-TM, we conduct classification, clustering, and regression experiments on tabular data, natural language text, images, and board games. Our results show that CSC-TM maintains accuracy with up to 80 times fewer literals. Indeed, the accuracy increases with shorter clauses for TREC, IMDb, and BBC Sports. After the accuracy peaks, it drops gracefully as the clause size approaches a single literal. We finally analyze CSC-TM power consumption and derive new convergence properties.
    Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation. (arXiv:2206.11489v2 [cs.LG] UPDATED)
    We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
    A Survey of Meta-Reinforcement Learning. (arXiv:2301.08028v1 [cs.LG])
    While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
    Geometric path augmentation for inference of sparsely observed stochastic nonlinear systems. (arXiv:2301.08102v1 [physics.data-an])
    Stochastic evolution equations describing the dynamics of systems under the influence of both deterministic and stochastic forces are prevalent in all fields of science. Yet, identifying these systems from sparse-in-time observations remains still a challenging endeavour. Existing approaches focus either on the temporal structure of the observations by relying on conditional expectations, discarding thereby information ingrained in the geometry of the system's invariant density; or employ geometric approximations of the invariant density, which are nevertheless restricted to systems with conservative forces. Here we propose a method that reconciles these two paradigms. We introduce a new data-driven path augmentation scheme that takes the local observation geometry into account. By employing non-parametric inference on the augmented paths, we can efficiently identify the deterministic driving forces of the underlying system for systems observed at low sampling rates.
    Learning to Rank by Causal Effects Without Data to Accurately Estimate Causal Effects. (arXiv:2206.12532v2 [stat.ML] UPDATED)
    Decision makers often want to identify the individuals for whom some intervention or treatment will be most effective in order to decide who to treat. In such cases, decision makers would ideally like to rank potential recipients of the treatment according to their individual causal effects. However, the available data may be completely inadequate to estimate causal effects accurately. We formalize a new assumption -- the rank preservation assumption (RPA) -- that defines when data are suitable to learn how to rank individuals according to their causal effects, even if the effects themselves cannot be accurately estimated. The RPA holds when there is data to estimate a scoring variable that induces the same ranking of individuals as the causal effect of interest. Some of the scoring variables we consider are confounded estimates, proxy causal effects, and non-causal quantities. We show that such scoring variables can work well for treatment assignment if the RPA is met, and potentially even better than using causal effects as scores. We also show that the RPA holds under conditions that are more general and weaker than the typical assumptions made in observational studies. Finally, we showcase how practitioners can apply and evaluate alternative scoring models (including non-causal models) to maximize the causal impact of their targeting decisions.
    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. (arXiv:2301.08243v1 [cs.CV])
    This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) predict several target blocks in the image, (b) sample target blocks with sufficiently large scale (occupying 15%-20% of the image), and (c) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strong downstream performance across a wide range of tasks requiring various levels of abstraction, from linear classification to object counting and depth prediction.
    TINKER: A framework for Open source Cyberthreat Intelligence. (arXiv:2102.05571v6 [cs.CR] UPDATED)
    Threat intelligence on malware attacks and campaigns is increasingly being shared with other security experts for a cost or for free. Other security analysts use this intelligence to inform them of indicators of compromise, attack techniques, and preventative actions. Security analysts prepare threat analysis reports after investigating an attack, an emerging cyber threat, or a recently discovered vulnerability. Collectively known as cyber threat intelligence (CTI), the reports are typically in an unstructured format and, therefore, challenging to integrate seamlessly into existing intrusion detection systems. This paper proposes a framework that uses the aggregated CTI for analysis and defense at scale. The information is extracted and stored in a structured format using knowledge graphs such that the semantics of the threat intelligence can be preserved and shared at scale with other security analysts. Specifically, we propose the first semi-supervised open-source knowledge graph-based framework, TINKER, to capture cyber threat information and its context. Following TINKER, we generate a Cyberthreat Intelligence Knowledge Graph (CTI-KG) and demonstrate the usage using different use cases.
    CEnt: An Entropy-based Model-agnostic Explainability Framework to Contrast Classifiers' Decisions. (arXiv:2301.07941v1 [cs.LG])
    Current interpretability methods focus on explaining a particular model's decision through present input features. Such methods do not inform the user of the sufficient conditions that alter these decisions when they are not desirable. Contrastive explanations circumvent this problem by providing explanations of the form "If the feature $X>x$, the output $Y$ would be different''. While different approaches are developed to find contrasts; these methods do not all deal with mutability and attainability constraints. In this work, we present a novel approach to locally contrast the prediction of any classifier. Our Contrastive Entropy-based explanation method, CEnt, approximates a model locally by a decision tree to compute entropy information of different feature splits. A graph, G, is then built where contrast nodes are found through a one-to-many shortest path search. Contrastive examples are generated from the shortest path to reflect feature splits that alter model decisions while maintaining lower entropy. We perform local sampling on manifold-like distances computed by variational auto-encoders to reflect data density. CEnt is the first non-gradient-based contrastive method generating diverse counterfactuals that do not necessarily exist in the training data while satisfying immutability (ex. race) and semi-immutability (ex. age can only change in an increasing direction). Empirical evaluation on four real-world numerical datasets demonstrates the ability of CEnt in generating counterfactuals that achieve better proximity rates than existing methods without compromising latency, feasibility, and attainability. We further extend CEnt to imagery data to derive visually appealing and useful contrasts between class labels on MNIST and Fashion MNIST datasets. Finally, we show how CEnt can serve as a tool to detect vulnerabilities of textual classifiers.
    Federated Automatic Differentiation. (arXiv:2301.07806v1 [cs.LG])
    Federated learning (FL) is a general framework for learning across heterogeneous clients while preserving data privacy, under the orchestration of a central server. FL methods often compute gradients of loss functions purely locally (ie. entirely at each client, or entirely at the server), typically using automatic differentiation (AD) techniques. We propose a federated automatic differentiation (FAD) framework that 1) enables computing derivatives of functions involving client and server computation as well as communication between them and 2) operates in a manner compatible with existing federated technology. In other words, FAD computes derivatives across communication boundaries. We show, in analogy with traditional AD, that FAD may be implemented using various accumulation modes, which introduce distinct computation-communication trade-offs and systems requirements. Further, we show that a broad class of federated computations is closed under these various modes of FAD, implying in particular that if the original computation can be implemented using privacy-preserving primitives, its derivative may be computed using only these same primitives. We then show how FAD can be used to create algorithms that dynamically learn components of the algorithm itself. In particular, we show that FedAvg-style algorithms can exhibit significantly improved performance by using FAD to adjust the server optimization step automatically, or by using FAD to learn weighting schemes for computing weighted averages across clients.
    Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms. (arXiv:2301.08015v1 [cs.GT])
    Wide machine learning tasks can be formulated as non-convex multi-player games, where Nash equilibrium (NE) is an acceptable solution to all players, since no one can benefit from changing its strategy unilaterally. Attributed to the non-convexity, obtaining the existence condition of global NE is challenging, let alone designing theoretically guaranteed realization algorithms. This paper takes conjugate transformation to the formulation of non-convex multi-player games, and casts the complementary problem into a variational inequality (VI) problem with a continuous pseudo-gradient mapping. We then prove the existence condition of global NE: the solution to the VI problem satisfies a duality relation. Based on this VI formulation, we design a conjugate-based ordinary differential equation (ODE) to approach global NE, which is proved to have an exponential convergence rate. To make the dynamics more implementable, we further derive a discretized algorithm. We apply our algorithm to two typical scenarios: multi-player generalized monotone game and multi-player potential game. In the two settings, we prove that the step-size setting is required to be $\mathcal{O}(1/k)$ and $\mathcal{O}(1/\sqrt k)$ to yield the convergence rates of $\mathcal{O}(1/ k)$ and $\mathcal{O}(1/\sqrt k)$, respectively. Extensive experiments in robust neural network training and sensor localization are in full agreement with our theory.
    Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization. (arXiv:2301.07784v1 [cs.LG])
    Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state spaces.
    Neural Regression For Scale-Varying Targets. (arXiv:2211.07447v4 [cs.LG] UPDATED)
    In this work, we demonstrate that a major limitation of regression using a mean-squared error loss is its sensitivity to the scale of its targets. This makes learning settings consisting of target's whose values take on varying scales challenging. A recently-proposed alternative loss function, known as histogram loss, avoids this issue. However, its computational cost grows linearly with the number of buckets in the histogram, which renders prediction with real-valued targets intractable. To address this issue, we propose a novel approach to training deep learning models on real-valued regression targets, autoregressive regression, which learns a high-fidelity distribution by utilizing an autoregressive target decomposition. We demonstrate that this training objective allows us to solve regression tasks involving targets with different scales.
    Understanding the diffusion models by conditional expectations. (arXiv:2301.07882v1 [cs.LG])
    This paper provide several mathematical analyses of the diffusion model in machine learning. The drift term of the backwards sampling process is represented as a conditional expectation involving the data distribution and the forward diffusion. The training process aims to find such a drift function by minimizing the mean-squared residue related to the conditional expectation. Using small-time approximations of the Green's function of the forward diffusion, we show that the analytical mean drift function in DDPM and the score function in SGM asymptotically blow up in the final stages of the sampling process for singular data distributions such as those concentrated on lower-dimensional manifolds, and is therefore difficult to approximate by a network. To overcome this difficulty, we derive a new target function and associated loss, which remains bounded even for singular data distributions. We illustrate the theoretical findings with several numerical examples.
    Interval Reachability of Nonlinear Dynamical Systems with Neural Network Controllers. (arXiv:2301.07912v1 [eess.SY])
    This paper proposes a computationally efficient framework, based on interval analysis, for rigorous verification of nonlinear continuous-time dynamical systems with neural network controllers. Given a neural network, we use an existing verification algorithm to construct inclusion functions for its input-output behavior. Inspired by mixed monotone theory, we embed the closed-loop dynamics into a larger system using an inclusion function of the neural network and a decomposition function of the open-loop system. This embedding provides a scalable approach for safety analysis of the neural control loop while preserving the nonlinear structure of the system. We show that one can efficiently compute hyper-rectangular over-approximations of the reachable sets using a single trajectory of the embedding system. We design an algorithm to leverage this computational advantage through partitioning strategies, improving our reachable set estimates while balancing its runtime with tunable parameters. We demonstrate the performance of this algorithm through two case studies. First, we demonstrate this method's strength in complex nonlinear environments. Then, we show that our approach matches the performance of the state-of-the art verification algorithm for linear discretized systems.
    Learning-Rate-Free Learning by D-Adaptation. (arXiv:2301.07733v1 [cs.LG])
    The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of $D$ yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available (https://github.com/facebookresearch/dadaptation).
    Continuously Reliable Detection of New-Normal Misinformation: Semantic Masking and Contrastive Smoothing in High-Density Latent Regions. (arXiv:2301.07981v1 [cs.LG])
    Toxic misinformation campaigns have caused significant societal harm, e.g., affecting elections and COVID-19 information awareness. Unfortunately, despite successes of (gold standard) retrospective studies of misinformation that confirmed their harmful effects after the fact, they arrive too late for timely intervention and reduction of such harm. By design, misinformation evades retrospective classifiers by exploiting two properties we call new-normal: (1) never-seen-before novelty that cause inescapable generalization challenges for previous classifiers, and (2) massive but short campaigns that end before they can be manually annotated for new classifier training. To tackle these challenges, we propose UFIT, which combines two techniques: semantic masking of strong signal keywords to reduce overfitting, and intra-proxy smoothness regularization of high-density regions in the latent space to improve reliability and maintain accuracy. Evaluation of UFIT on public new-normal misinformation data shows over 30% improvement over existing approaches on future (and unseen) campaigns. To the best of our knowledge, UFIT is the first successful effort to achieve such high level of generalization on new-normal misinformation data with minimal concession (1 to 5%) of accuracy compared to oracles trained with full knowledge of all campaigns.
    Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient. (arXiv:2301.08215v1 [cs.LG])
    A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.
    Multi-Agent Interplay in a Competitive Survival Environment. (arXiv:2301.08030v1 [cs.LG])
    Solving hard-exploration environments in an important challenge in Reinforcement Learning. Several approaches have been proposed and studied, such as Intrinsic Motivation, co-evolution of agents and tasks, and multi-agent competition. In particular, the interplay between multiple agents has proven to be capable of generating human-relevant emergent behaviour that would be difficult or impossible to learn in single-agent settings. In this work, an extensible competitive environment for multi-agent interplay was developed, which features realistic physics and human-relevant semantics. Moreover, several experiments on different variants of this environment were performed, resulting in some simple emergent strategies and concrete directions for future improvement. The content presented here is part of the author's thesis "Multi-Agent Interplay in a Competitive Survival Environment" for the Master's Degree in Artificial Intelligence and Robotics at Sapienza University of Rome, 2022.
    A Survey of Zero-shot Generalisation in Deep Reinforcement Learning. (arXiv:2111.09794v6 [cs.LG] UPDATED)
    The study of zero-shot generalisation (ZSG) in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We rely on a unifying formalism and terminology for discussing different ZSG problems, building upon previous works. We go on to categorise existing benchmarks for ZSG, as well as current methods for tackling these problems. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in ZSG, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for ZSG, and we recommend building benchmarks in underexplored problem settings such as offline RL ZSG and reward-function variation.
    Explainability in subgraphs-enhanced Graph Neural Networks. (arXiv:2209.07926v2 [cs.LG] UPDATED)
    Recently, subgraphs-enhanced Graph Neural Networks (SGNNs) have been introduced to enhance the expressive power of Graph Neural Networks (GNNs), which was proved to be not higher than the 1-dimensional Weisfeiler-Leman isomorphism test. The new paradigm suggests using subgraphs extracted from the input graph to improve the model's expressiveness, but the additional complexity exacerbates an already challenging problem in GNNs: explaining their predictions. In this work, we adapt PGExplainer, one of the most recent explainers for GNNs, to SGNNs. The proposed explainer accounts for the contribution of all the different subgraphs and can produce a meaningful explanation that humans can interpret. The experiments that we performed both on real and synthetic datasets show that our framework is successful in explaining the decision process of an SGNN on graph classification tasks.
    Concept Discovery for Fast Adapatation. (arXiv:2301.07850v1 [cs.LG])
    The advances in deep learning have enabled machine learning methods to outperform human beings in various areas, but it remains a great challenge for a well-trained model to quickly adapt to a new task. One promising solution to realize this goal is through meta-learning, also known as learning to learn, which has achieved promising results in few-shot learning. However, current approaches are still enormously different from human beings' learning process, especially in the ability to extract structural and transferable knowledge. This drawback makes current meta-learning frameworks non-interpretable and hard to extend to more complex tasks. We tackle this problem by introducing concept discovery to the few-shot learning problem, where we achieve more effective adaptation by meta-learning the structure among the data features, leading to a composite representation of the data. Our proposed method Concept-Based Model-Agnostic Meta-Learning (COMAML) has been shown to achieve consistent improvements in the structured data for both synthesized datasets and real-world datasets.
    Using CycleGANs to Generate Realistic STEM Images for Machine Learning. (arXiv:2301.07743v1 [cond-mat.mtrl-sci])
    The rise of automation and machine learning (ML) in electron microscopy has the potential to revolutionize materials research by enabling the autonomous collection and processing of vast amounts of atomic resolution data. However, a major challenge is developing ML models that can reliably and rapidly generalize to large data sets with varying experimental conditions. To overcome this challenge, we develop a cycle generative adversarial network (CycleGAN) that introduces a novel reciprocal space discriminator to augment simulated data with realistic, complex spatial frequency information learned from experimental data. This enables the CycleGAN to generate nearly indistinguishable images from real experimental data, while also providing labels for further ML applications. We demonstrate the effectiveness of this approach by training a fully convolutional network (FCN) to identify single atom defects in a large data set of 4.5 million atoms, which we collected using automated acquisition in an aberration-corrected scanning transmission electron microscope (STEM). Our approach yields highly adaptable FCNs that can adjust to dynamically changing experimental variables, such as lens aberrations, noise, and local contamination, with minimal manual intervention. This represents a significant step towards building fully autonomous approaches for harnessing microscopy big data.
    Position Regression for Unsupervised Anomaly Detection. (arXiv:2301.08064v1 [cs.CV])
    In recent years, anomaly detection has become an essential field in medical image analysis. Most current anomaly detection methods for medical images are based on image reconstruction. In this work, we propose a novel anomaly detection approach based on coordinate regression. Our method estimates the position of patches within a volume, and is trained only on data of healthy subjects. During inference, we can detect and localize anomalies by considering the error of the position estimate of a given patch. We apply our method to 3D CT volumes and evaluate it on patients with intracranial haemorrhages and cranial fractures. The results show that our method performs well in detecting these anomalies. Furthermore, we show that our method requires less memory than comparable approaches that involve image reconstruction. This is highly relevant for processing large 3D volumes, for instance, CT or MRI scans.
    A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems. (arXiv:2301.07799v1 [cs.LG])
    Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.
    A Nonstochastic Control Approach to Optimization. (arXiv:2301.07902v1 [cs.LG])
    Tuning optimizer hyperparameters, notably the learning rate to a particular optimization instance, is an important but nonconvex problem. Therefore iterative optimization methods such as hypergradient descent lack global optimality guarantees in general. We propose an online nonstochastic control methodology for mathematical optimization. The choice of hyperparameters for gradient based methods, including the learning rate, momentum parameter and preconditioner, is described as feedback control. The optimal solution to this control problem is shown to encompass preconditioned adaptive gradient methods with varying acceleration and momentum parameters. Although the optimal control problem by itself is nonconvex, we show how recent methods from online nonstochastic control based on convex relaxation can be applied to compete with the best offline solution. This guarantees that in episodic optimization, we converge to the best optimization method in hindsight.
    Identification, explanation and clinical evaluation of hospital patient subtypes. (arXiv:2301.08019v1 [cs.LG])
    We present a pipeline in which unsupervised machine learning techniques are used to automatically identify subtypes of hospital patients admitted between 2017 and 2021 in a large UK teaching hospital. With the use of state-of-the-art explainability techniques, the identified subtypes are interpreted and assigned clinical meaning. In parallel, clinicians assessed intra-cluster similarities and inter-cluster differences of the identified patient subtypes within the context of their clinical knowledge. By confronting the outputs of both automatic and clinician-based explanations, we aim to highlight the mutual benefit of combining machine learning techniques with clinical expertise.
    Discover governing differential equations from evolving systems. (arXiv:2301.07863v1 [physics.comp-ph])
    Discovering the governing equations of evolving systems from available observations is essential and challenging. However, current methods does not capture the situation that underlying system dynamics can be changed.Evolving systems are changing over time, which invariably changes with system status. Thus, finding the exact change points is critical. We propose an online modeling method capable of handling samples one by one sequentially by modeling streaming data instead of processing the entire dataset. The proposed method performs well in discovering ordinary differential equations, partial differential equations (PDEs), and high-dimensional PDEs from streaming data. The measurement generated from a changed system is distributed dissimilarly to before; hence, the difference can be identified by the proposed method. Our proposal performs well in identifying the change points and discovering governing differential equations in two evolving systems.
    Catapult Dynamics and Phase Transitions in Quadratic Nets. (arXiv:2301.07737v1 [cs.LG])
    Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In (Lewkowycz et al., 2020) it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.
    ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection. (arXiv:2301.07846v1 [cs.DC])
    With the increasing prevalence of scalable file systems in the context of High Performance Computing (HPC), the importance of accurate anomaly detection on runtime logs is increasing. But as it currently stands, many state-of-the-art methods for log-based anomaly detection, such as DeepLog, have encountered numerous challenges when applied to logs from many parallel file systems (PFSes), often due to their irregularity and ambiguity in time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a log pre-processing method that clusters the temporal sequence of log keys based on their semantic similarity. By grouping semantically and sentimentally similar logs, this approach aims to represent log sequences with the smallest amount of unique log keys, intending to improve the ability of a downstream sequence-based model to effectively learn the log patterns. The preliminary results of ClusterLog indicate not only its effectiveness in reducing the granularity of log sequences without the loss of important sequence information but also its generalizability to different file systems' logs.
    From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition. (arXiv:2301.07851v1 [cs.SD])
    In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
    Suboptimality analysis of receding horizon quadratic control with unknown linear systems and its applications in learning-based control. (arXiv:2301.07876v1 [eess.SY])
    For a receding-horizon controller with a known system and with an approximate terminal value function, it is well-known that increasing the prediction horizon can improve its control performance. However, when the prediction model is inexact, a larger prediction horizon also causes propagation and accumulation of the prediction error. In this work, we aim to analyze the effect of the above trade-off between the modeling error, the terminal value function error, and the prediction horizon on the performance of a nominal receding-horizon linear quadratic (LQ) controller. By developing a novel perturbation result of the Riccati difference equation, a performance upper bound is obtained and suggests that for many cases, the prediction horizon should be either 1 or infinity to improve the control performance, depending on the relative difference between the modeling error and the terminal value function error. The obtained suboptimality performance bound is also applied to provide end-to-end performance guarantees, e.g., regret bounds, for nominal receding-horizon LQ controllers in a learning-based setting.
    HCE: Improving Performance and Efficiency with Heterogeneously Compressed Neural Network Ensemble. (arXiv:2301.07794v1 [cs.LG])
    Ensemble learning has gain attention in resent deep learning research as a way to further boost the accuracy and generalizability of deep neural network (DNN) models. Recent ensemble training method explores different training algorithms or settings on multiple sub-models with the same model architecture, which lead to significant burden on memory and computation cost of the ensemble model. Meanwhile, the heurtsically induced diversity may not lead to significant performance gain. We propose a new prespective on exploring the intrinsic diversity within a model architecture to build efficient DNN ensemble. We make an intriguing observation that pruning and quantization, while both leading to efficient model architecture at the cost of small accuracy drop, leads to distinct behavior in the decision boundary. To this end, we propose Heterogeneously Compressed Ensemble (HCE), where we build an efficient ensemble with the pruned and quantized variants from a pretrained DNN model. An diversity-aware training objective is proposed to further boost the performance of the HCE ensemble. Experiemnt result shows that HCE achieves significant improvement in the efficiency-accuracy tradeoff comparing to both traditional DNN ensemble training methods and previous model compression methods.
    FE-TCM: Filter-Enhanced Transformer Click Model for Web Search. (arXiv:2301.07854v1 [cs.IR])
    Constructing click models and extracting implicit relevance feedback information from the interaction between users and search engines are very important to improve the ranking of search results. Using neural network to model users' click behaviors has become one of the effective methods to construct click models. In this paper, We use Transformer as the backbone network of feature extraction, add filter layer innovatively, and propose a new Filter-Enhanced Transformer Click Model (FE-TCM) for web search. Firstly, in order to reduce the influence of noise on user behavior data, we use the learnable filters to filter log noise. Secondly, following the examination hypothesis, we model the attraction estimator and examination predictor respectively to output the attractiveness scores and examination probabilities. A novel transformer model is used to learn the deeper representation among different features. Finally, we apply the combination functions to integrate attractiveness scores and examination probabilities into the click prediction. From our experiments on two real-world session datasets, it is proved that FE-TCM outperforms the existing click models for the click prediction.
    A Scalable Finite Difference Method for Deep Reinforcement Learning. (arXiv:2210.07487v2 [cs.LG] UPDATED)
    Several low-bandwidth distributable black-box optimization algorithms in the family of finite differences such as Evolution Strategies have recently been shown to perform nearly as well as tailored Reinforcement Learning methods in some Reinforcement Learning domains. One shortcoming of these black-box methods is that they must collect information about the structure of the return function at every update, and can often employ only information drawn from a distribution centered around the current parameters. As a result, when these algorithms are distributed across many machines, a significant portion of total runtime may be spent with many machines idle, waiting for a final return and then for an update to be calculated. In this work we introduce a novel method to use older data in finite difference algorithms, which produces a scalable algorithm that avoids significant idle time or wasted computation.
    RNAS-CL: Robust Neural Architecture Search by Cross-Layer Knowledge Distillation. (arXiv:2301.08092v1 [cs.CV])
    Deep Neural Networks are vulnerable to adversarial attacks. Neural Architecture Search (NAS), one of the driving tools of deep neural networks, demonstrates superior performance in prediction accuracy in various machine learning applications. However, it is unclear how it performs against adversarial attacks. Given the presence of a robust teacher, it would be interesting to investigate if NAS would produce robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that improves the robustness of NAS by learning from a robust teacher through cross-layer knowledge distillation. Unlike previous knowledge distillation methods that encourage close student/teacher output only in the last layer, RNAS-CL automatically searches for the best teacher layer to supervise each student layer. Experimental result evidences the effectiveness of RNAS-CL and shows that RNAS-CL produces small and robust neural architecture.
    SpotHitPy: A Study For ML-Based Song Hit Prediction Using Spotify. (arXiv:2301.07978v1 [cs.SD])
    In this study, we approached the Hit Song Prediction problem, which aims to predict which songs will become Billboard hits. We gathered a dataset of nearly 18500 hit and non-hit songs and extracted their audio features using the Spotify Web API. We test four machine-learning models on our dataset. We were able to predict the Billboard success of a song with approximately 86\% accuracy. The most succesful algorithms were Random Forest and Support Vector Machine.
    General Greedy De-bias Learning. (arXiv:2112.10572v5 [cs.LG] UPDATED)
    Neural networks often make predictions relying on the spurious correlations from the datasets rather than the intrinsic properties of the task of interest, facing sharp degradation on out-of-distribution (OOD) test data. Existing de-bias learning frameworks try to capture specific dataset bias by annotations but they fail to handle complicated OOD scenarios. Others implicitly identify the dataset bias by special design low capability biased models or losses, but they degrade when the training and testing data are from the same distribution. In this paper, we propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model. The base model is encouraged to focus on examples that are hard to solve with biased models, thus remaining robust against spurious correlations in the test stage. GGD largely improves models' OOD generalization ability on various tasks, but sometimes over-estimates the bias level and degrades on the in-distribution test. We further re-analyze the ensemble process of GGD and introduce the Curriculum Regularization inspired by curriculum learning, which achieves a good trade-off between in-distribution and out-of-distribution performance. Extensive experiments on image classification, adversarial question answering, and visual question answering demonstrate the effectiveness of our method. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
    Learning Quantum Processes with Memory -- Quantum Recurrent Neural Networks. (arXiv:2301.08167v1 [quant-ph])
    Recurrent neural networks play an important role in both research and industry. With the advent of quantum machine learning, the quantisation of recurrent neural networks has become recently relevant. We propose fully quantum recurrent neural networks, based on dissipative quantum neural networks, capable of learning general causal quantum automata. A quantum training algorithm is proposed and classical simulations for the case of product outputs with the fidelity as cost function are carried out. We thereby demonstrate the potential of these algorithms to learn complex quantum processes with memory in terms of the exemplary delay channel, the time evolution of quantum states governed by a time-dependent Hamiltonian, and high- and low-frequency noise mitigation. Numerical simulations indicate that our quantum recurrent neural networks exhibit a striking ability to generalise from small training sets.
    Augmenting a Physics-Informed Neural Network for the 2D Burgers Equation by Addition of Solution Data Points. (arXiv:2301.07824v1 [physics.flu-dyn])
    We implement a Physics-Informed Neural Network (PINN) for solving the two-dimensional Burgers equations. This type of model can be trained with no previous knowledge of the solution; instead, it relies on evaluating the governing equations of the system in points of the physical domain. It is also possible to use points with a known solution during training. In this paper, we compare PINNs trained with different amounts of governing equation evaluation points and known solution points. Comparing models that were trained purely with known solution points to those that have also used the governing equations, we observe an improvement in the overall observance of the underlying physics in the latter. We also investigate how changing the number of each type of point affects the resulting models differently. Finally, we argue that the addition of the governing equations during training may provide a way to improve the overall performance of the model without relying on additional data, which is especially important for situations where the number of known solution points is limited.
    Improving Machine Translation with Phrase Pair Injection and Corpus Filtering. (arXiv:2301.08008v1 [cs.CL])
    In this paper, we show that the combination of Phrase Pair Injection and Corpus Filtering boosts the performance of Neural Machine Translation (NMT) systems. We extract parallel phrases and sentences from the pseudo-parallel corpus and augment it with the parallel corpus to train the NMT models. With the proposed approach, we observe an improvement in the Machine Translation (MT) system for 3 low-resource language pairs, Hindi-Marathi, English-Marathi, and English-Pashto, and 6 translation directions by up to 2.7 BLEU points, on the FLORES test data. These BLEU score improvements are over the models trained using the whole pseudo-parallel corpus augmented with the parallel corpus.
    WaveMix: A Resource-efficient Neural Network for Image Analysis. (arXiv:2205.14375v3 [cs.CV] UPDATED)
    To allow image analysis in resource-constrained scenarios without compromising generalizability, we introduce WaveMix -- a novel and flexible neural framework that reduces the GPU RAM (memory) and compute (latency) compared to CNNs and transformers. In addition to using convolutional layers that exploit shift-invariant image statistics, the proposed framework uses multi-level two-dimensional discrete wavelet transform (2D-DWT) modules to exploit scale-invariance and edge sparseness, which gives it the following advantages. Firstly, the fixed weights of wavelet modules do not add to the parameter count while reorganizing information based on these image priors. Secondly, the wavelet modules scale the spatial extents of feature maps by integral powers of $\frac{1}{2}\times\frac{1}{2}$, which reduces the memory and latency required for forward and backward passes. Finally, a multi-level 2D-DWT leads to a quicker expansion of the receptive field per layer than pooling (which we do not use) and it is a more effective spatial token mixer. WaveMix also generalizes better than other token mixing models, such as ConvMixer, MLP-Mixer, PoolFormer, random filters, and Fourier basis, because the wavelet transform is much better suited for image decomposition and spatial token mixing. WaveMix is a flexible model that can perform well on multiple image tasks without needing architectural modifications. WaveMix achieves a semantic segmentation mIoU of 83% on the Cityscapes validation set outperforming transformer and CNN-based architectures. We also demonstrate the advantages of WaveMix for classification on multiple datasets and show that WaveMix establishes new state-of-the-results in Places-365, EMNIST, and iNAT-mini datasets.
    PDFormer: Propagation Delay-aware Dynamic Long-range Transformer for Traffic Flow Prediction. (arXiv:2301.07945v1 [cs.LG])
    As a core technology of Intelligent Transportation System, traffic flow prediction has a wide range of applications. The fundamental challenge in traffic flow prediction is to effectively model the complex spatial-temporal dependencies in traffic data. Spatial-temporal Graph Neural Network (GNN) models have emerged as one of the most promising methods to solve this problem. However, GNN-based models have three major limitations for traffic prediction: i) Most methods model spatial dependencies in a static manner, which limits the ability to learn dynamic urban traffic patterns; ii) Most methods only consider short-range spatial information and are unable to capture long-range spatial dependencies; iii) These methods ignore the fact that the propagation of traffic conditions between locations has a time delay in traffic systems. To this end, we propose a novel Propagation Delay-aware dynamic long-range transFormer, namely PDFormer, for accurate traffic flow prediction. Specifically, we design a spatial self-attention module to capture the dynamic spatial dependencies. Then, two graph masking matrices are introduced to highlight spatial dependencies from short- and long-range views. Moreover, a traffic delay-aware feature transformation module is proposed to empower PDFormer with the capability of explicitly modeling the time delay of spatial information propagation. Extensive experimental results on six real-world public traffic datasets show that our method can not only achieve state-of-the-art performance but also exhibit competitive computational efficiency. Moreover, we visualize the learned spatial-temporal attention map to make our model highly interpretable.
    Learning Generalizable Models for Vehicle Routing Problems via Knowledge Distillation. (arXiv:2210.07686v2 [cs.LG] UPDATED)
    Recent neural methods for vehicle routing problems always train and test the deep models on the same instance distribution (i.e., uniform). To tackle the consequent cross-distribution generalization concerns, we bring the knowledge distillation to this field and propose an Adaptive Multi-Distribution Knowledge Distillation (AMDKD) scheme for learning more generalizable deep models. Particularly, our AMDKD leverages various knowledge from multiple teachers trained on exemplar distributions to yield a light-weight yet generalist student model. Meanwhile, we equip AMDKD with an adaptive strategy that allows the student to concentrate on difficult distributions, so as to absorb hard-to-master knowledge more effectively. Extensive experimental results show that, compared with the baseline neural methods, our AMDKD is able to achieve competitive results on both unseen in-distribution and out-of-distribution instances, which are either randomly synthesized or adopted from benchmark datasets (i.e., TSPLIB and CVRPLIB). Notably, our AMDKD is generic, and consumes less computational resources for inference.
    An SDE for Modeling SAM: Theory and Insights. (arXiv:2301.08203v1 [cs.LG])
    We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and its unnormalized variant USAM, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the step size). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones - by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that perhaps unexpectedly SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
    Fully Elman Neural Network: A Novel Deep Recurrent Neural Network Optimized by an Improved Harris Hawks Algorithm for Classification of Pulmonary Arterial Wedge Pressure. (arXiv:2301.07710v1 [cs.LG])
    Heart failure (HF) is one of the most prevalent life-threatening cardiovascular diseases in which 6.5 million people are suffering in the USA and more than 23 million worldwide. Mechanical circulatory support of HF patients can be achieved by implanting a left ventricular assist device (LVAD) into HF patients as a bridge to transplant, recovery or destination therapy and can be controlled by measurement of normal and abnormal pulmonary arterial wedge pressure (PAWP). While there are no commercial long-term implantable pressure sensors to measure PAWP, real-time non-invasive estimation of abnormal and normal PAWP becomes vital. In this work, first an improved Harris Hawks optimizer algorithm called HHO+ is presented and tested on 24 unimodal and multimodal benchmark functions. Second, a novel fully Elman neural network (FENN) is proposed to improve the classification performance. Finally, four novel 18-layer deep learning methods of convolutional neural networks (CNNs) with multi-layer perceptron (CNN-MLP), CNN with Elman neural networks (CNN-ENN), CNN with fully Elman neural networks (CNN-FENN), and CNN with fully Elman neural networks optimized by HHO+ algorithm (CNN-FENN-HHO+) for classification of abnormal and normal PAWP using estimated HVAD pump flow were developed and compared. The estimated pump flow was derived by a non-invasive method embedded into the commercial HVAD controller. The proposed methods are evaluated on an imbalanced clinical dataset using 5-fold cross-validation. The proposed CNN-FENN-HHO+ method outperforms the proposed CNN-MLP, CNN-ENN and CNN-FENN methods and improved the classification performance metrics across 5-fold cross-validation. The proposed methods can reduce the likelihood of hazardous events like pulmonary congestion and ventricular suction for HF patients and notify identified abnormal cases to the hospital, clinician and cardiologist.
    Skeleton Clustering: Dimension-Free Density-based Clustering. (arXiv:2104.10770v2 [stat.ML] UPDATED)
    We introduce a density-based clustering method called skeleton clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios.
    Emergence of the SVD as an interpretable factorization in deep learning for inverse problems. (arXiv:2301.07820v1 [cs.LG])
    We demonstrate the emergence of weight matrix singular value decomposition (SVD) in interpreting neural networks (NNs) for parameter estimation from noisy signals. The SVD appears naturally as a consequence of initial application of a descrambling transform - a recently-developed technique for addressing interpretability in NNs \cite{amey2021neural}. We find that within the class of noisy parameter estimation problems, the SVD may be the means by which networks memorize the signal model. We substantiate our theoretical findings with empirical evidence from both linear and non-linear settings. Our results also illuminate the connections between a mathematical theory of semantic development \cite{saxe2019mathematical} and neural network interpretability.
    Spatio-temporal neural structural causal models for bike flow prediction. (arXiv:2301.07843v1 [cs.LG])
    As a representative of public transportation, the fundamental issue of managing bike-sharing systems is bike flow prediction. Recent methods overemphasize the spatio-temporal correlations in the data, ignoring the effects of contextual conditions on the transportation system and the inter-regional timevarying causality. In addition, due to the disturbance of incomplete observations in the data, random contextual conditions lead to spurious correlations between data and features, making the prediction of the model ineffective in special scenarios. To overcome this issue, we propose a Spatio-temporal Neural Structure Causal Model(STNSCM) from the perspective of causality. First, we build a causal graph to describe the traffic prediction, and further analyze the causal relationship between the input data, contextual conditions, spatiotemporal states, and prediction results. Second, we propose to apply the frontdoor criterion to eliminate confounding biases in the feature extraction process. Finally, we propose a counterfactual representation reasoning module to extrapolate the spatio-temporal state under the factual scenario to future counterfactual scenarios to improve the prediction performance. Experiments on real-world datasets demonstrate the superior performance of our model, especially its resistance to fluctuations caused by the external environment. The source code and data will be released.
  • Open

    Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms. (arXiv:2301.08015v1 [cs.GT])
    Wide machine learning tasks can be formulated as non-convex multi-player games, where Nash equilibrium (NE) is an acceptable solution to all players, since no one can benefit from changing its strategy unilaterally. Attributed to the non-convexity, obtaining the existence condition of global NE is challenging, let alone designing theoretically guaranteed realization algorithms. This paper takes conjugate transformation to the formulation of non-convex multi-player games, and casts the complementary problem into a variational inequality (VI) problem with a continuous pseudo-gradient mapping. We then prove the existence condition of global NE: the solution to the VI problem satisfies a duality relation. Based on this VI formulation, we design a conjugate-based ordinary differential equation (ODE) to approach global NE, which is proved to have an exponential convergence rate. To make the dynamics more implementable, we further derive a discretized algorithm. We apply our algorithm to two typical scenarios: multi-player generalized monotone game and multi-player potential game. In the two settings, we prove that the step-size setting is required to be $\mathcal{O}(1/k)$ and $\mathcal{O}(1/\sqrt k)$ to yield the convergence rates of $\mathcal{O}(1/ k)$ and $\mathcal{O}(1/\sqrt k)$, respectively. Extensive experiments in robust neural network training and sensor localization are in full agreement with our theory.
    Skeleton Clustering: Dimension-Free Density-based Clustering. (arXiv:2104.10770v2 [stat.ML] UPDATED)
    We introduce a density-based clustering method called skeleton clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios.
    Shapley Values with Uncertain Value Functions. (arXiv:2301.08086v1 [cs.LG])
    We propose a novel definition of Shapley values with uncertain value functions based on first principles using probability theory. Such uncertain value functions can arise in the context of explainable machine learning as a result of non-deterministic algorithms. We show that random effects can in fact be absorbed into a Shapley value with a noiseless but shifted value function. Hence, Shapley values with uncertain value functions can be used in analogy to regular Shapley values. However, their reliable evaluation typically requires more computational effort.
    Score-based Causal Representation Learning with Interventions. (arXiv:2301.08230v1 [stat.ML])
    This paper studies causal representation learning problem when the latent causal variables are observed indirectly through an unknown linear transformation. The objectives are: (i) recovering the unknown linear transformation (up to scaling and ordering), and (ii) determining the directed acyclic graph (DAG) underlying the latent variables. Since identifiable representation learning is impossible based on only observational data, this paper uses both observational and interventional data. The interventional data is generated under distinct single-node randomized hard and soft interventions. These interventions are assumed to cover all nodes in the latent space. It is established that the latent DAG structure can be recovered under soft randomized interventions via the following two steps. First, a set of transformation candidates is formed by including all inverting transformations corresponding to which the \emph{score} function of the transformed variables has the minimal number of coordinates that change between an interventional and the observational environment summed over all pairs. Subsequently, this set is distilled using a simple constraint to recover the latent DAG structure. For the special case of hard randomized interventions, with an additional hypothesis testing step, one can also uniquely recover the linear transformation, up to scaling and a valid causal ordering. These results generalize the recent results that either assume deterministic hard interventions or linear causal relationships in the latent space.
    Learning to Rank by Causal Effects Without Data to Accurately Estimate Causal Effects. (arXiv:2206.12532v2 [stat.ML] UPDATED)
    Decision makers often want to identify the individuals for whom some intervention or treatment will be most effective in order to decide who to treat. In such cases, decision makers would ideally like to rank potential recipients of the treatment according to their individual causal effects. However, the available data may be completely inadequate to estimate causal effects accurately. We formalize a new assumption -- the rank preservation assumption (RPA) -- that defines when data are suitable to learn how to rank individuals according to their causal effects, even if the effects themselves cannot be accurately estimated. The RPA holds when there is data to estimate a scoring variable that induces the same ranking of individuals as the causal effect of interest. Some of the scoring variables we consider are confounded estimates, proxy causal effects, and non-causal quantities. We show that such scoring variables can work well for treatment assignment if the RPA is met, and potentially even better than using causal effects as scores. We also show that the RPA holds under conditions that are more general and weaker than the typical assumptions made in observational studies. Finally, we showcase how practitioners can apply and evaluate alternative scoring models (including non-causal models) to maximize the causal impact of their targeting decisions.
    Everything is Connected: Graph Neural Networks. (arXiv:2301.08210v1 [cs.LG])
    In many ways, graphs are the main modality of data we receive from nature. This is due to the fact that most of the patterns we see, both in natural and artificial systems, are elegantly representable using the language of graph structures. Prominent examples include molecules (represented as graphs of atoms and bonds), social networks and transportation networks. This potential has already been seen by key scientific and industrial groups, with already-impacted application areas including traffic forecasting, drug discovery, social network analysis and recommender systems. Further, some of the most successful domains of application for machine learning in previous years -- images, text and speech processing -- can be seen as special cases of graph representation learning, and consequently there has been significant exchange of information between these areas. The main aim of this short survey is to enable the reader to assimilate the key concepts in the area, and position graph representation learning in a proper context with related fields.
    Diffusion-based Conditional ECG Generation with Structured State Space Models. (arXiv:2301.08227v1 [eess.SP])
    Synthetic data generation is a promising solution to address privacy issues with the distribution of sensitive health data. Recently, diffusion models have set new standards for generative models for different data modalities. Also very recently, structured state space models emerged as a powerful modeling paradigm to capture long-term dependencies in time series. We put forward SSSD-ECG, as the combination of these two technologies, for the generation of synthetic 12-lead electrocardiograms conditioned on more than 70 ECG statements. Due to a lack of reliable baselines, we also propose conditional variants of two state-of-the-art unconditional generative models. We thoroughly evaluate the quality of the generated samples, by evaluating pretrained classifiers on the generated data and by evaluating the performance of a classifier trained only on synthetic data, where SSSD-ECG clearly outperforms its GAN-based competitors. We demonstrate the soundness of our approach through further experiments, including conditional class interpolation and a clinical Turing test demonstrating the high quality of the SSSD-ECG samples across a wide range of conditions.
    Equivalence relations and $L^p$ distances between time series with application to the Black Summer Australian bushfires. (arXiv:2002.02592v2 [stat.ME] UPDATED)
    This paper introduces a new framework of algebraic equivalence relations between time series and new distance metrics between them, then applies these to investigate the Australian ``Black Summer'' bushfire season of 2019-2020. First, we introduce a general framework for defining equivalence between time series, heuristically intended to be equivalent if they differ only up to noise. Our first specific implementation is based on using change point algorithms and comparing statistical quantities such as mean or variance in stationary segments. We thus derive the existence of such equivalence relations on the space of time series, such that the quotient spaces can be equipped with a metrizable topology. Next, we illustrate specifically how to define and compute such distances among a collection of time series and perform clustering and additional analysis thereon. Then, we apply these insights to analyze air quality data across New South Wales, Australia, during the 2019-2020 bushfires. There, we investigate structural similarity with respect to this data and identify locations that were impacted anonymously by the fires relative to their location. This may have implications regarding the appropriate management of resources to avoid gaps in the defense against future fires.
    Semiparametric inference using fractional posteriors. (arXiv:2301.08158v1 [math.ST])
    We establish a general Bernstein--von Mises theorem for approximately linear semiparametric functionals of fractional posterior distributions based on nonparametric priors. This is illustrated in a number of nonparametric settings and for different classes of prior distributions, including Gaussian process priors. We show that fractional posterior credible sets can provide reliable semiparametric uncertainty quantification, but have inflated size. To remedy this, we further propose a \textit{shifted-and-rescaled} fractional posterior set that is an efficient confidence set having optimal size under regularity conditions. As part of our proofs, we also refine existing contraction rate results for fractional posteriors by sharpening the dependence of the rate on the fractional exponent.
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v3 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and an (2, 1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, complementary to classic compression via weight pruning. Source code is available at https://github.com/rkwitt/excess_capacity.  ( 2 min )
    Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient. (arXiv:2301.08215v1 [cs.LG])
    A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.  ( 2 min )
    Differentially Private Online Bayesian Estimation With Adaptive Truncation. (arXiv:2301.08202v1 [cs.LG])
    We propose a novel online and adaptive truncation method for differentially private Bayesian online estimation of a static parameter regarding a population. We assume that sensitive information from individuals is collected sequentially and the inferential aim is to estimate, on-the-fly, a static parameter regarding the population to which those individuals belong. We propose sequential Monte Carlo to perform online Bayesian estimation. When individuals provide sensitive information in response to a query, it is necessary to perturb it with privacy-preserving noise to ensure the privacy of those individuals. The amount of perturbation is proportional to the sensitivity of the query, which is determined usually by the range of the queried information. The truncation technique we propose adapts to the previously collected observations to adjust the query range for the next individual. The idea is that, based on previous observations, we can carefully arrange the interval into which the next individual's information is to be truncated before being perturbed with privacy-preserving noise. In this way, we aim to design predictive queries with small sensitivity, hence small privacy-preserving noise, enabling more accurate estimation while maintaining the same level of privacy. To decide on the location and the width of the interval, we use an exploration-exploitation approach a la Thompson sampling with an objective function based on the Fisher information of the generated observation. We show the merits of our methodology with numerical examples.  ( 2 min )
    A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs. (arXiv:2301.08187v1 [stat.ML])
    U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.  ( 2 min )
    Robust Gaussian Process Regression with Huber Likelihood. (arXiv:2301.07858v1 [stat.AP])
    Gaussian process regression in its most simplified form assumes normal homoscedastic noise and utilizes analytically tractable mean and covariance functions of predictive posterior distribution using Gaussian conditioning. Its hyperparameters are estimated by maximizing the evidence, commonly known as type II maximum likelihood estimation. Unfortunately, Bayesian inference based on Gaussian likelihood is not robust to outliers, which are often present in the observational training data sets. To overcome this problem, we propose a robust process model in the Gaussian process framework with the likelihood of observed data expressed as the Huber probability distribution. The proposed model employs weights based on projection statistics to scale residuals and bound the influence of vertical outliers and bad leverage points on the latent functions estimates while exhibiting a high statistical efficiency at the Gaussian and thick tailed noise distributions. The proposed method is demonstrated by two real world problems and two numerical examples using datasets with additive errors following thick tailed distributions such as Students t, Laplace, and Cauchy distribution.  ( 2 min )
    Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity -- the Strongly Convex Case. (arXiv:2301.08039v1 [math.PR])
    In this article we consider sampling from log concave distributions in Hamiltonian setting, without assuming that the objective gradient is globally Lipschitz. We propose two algorithms based on monotone polygonal (tamed) Euler schemes, to sample from a target measure, and provide non-asymptotic 2-Wasserstein distance bounds between the law of the process of each algorithm and the target measure. Finally, we apply these results to bound the excess risk optimization error of the associated optimization problem.  ( 2 min )
    Learning-Rate-Free Learning by D-Adaptation. (arXiv:2301.07733v1 [cs.LG])
    The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of $D$ yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available (https://github.com/facebookresearch/dadaptation).  ( 2 min )
    Rates of convergence for density estimation with generative adversarial networks. (arXiv:2102.00199v3 [math.ST] UPDATED)
    In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We prove a sharp oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $\mathsf{p}^*$ and the GAN estimate. We also study the rates of convergence in the context of nonparametric density estimation. In particular, we show that the JS-divergence between the GAN estimate and $\mathsf{p}^*$ decays as fast as $(\log{n}/n)^{2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. To the best of our knowledge, this is the first result in the literature on density estimation using vanilla GANs with JS convergence rates faster than $n^{-1/2}$ in the regime $\beta > d/2$. Moreover, we show that the obtained rate is minimax optimal (up to logarithmic factors) for the considered class of densities.  ( 2 min )
    Catapult Dynamics and Phase Transitions in Quadratic Nets. (arXiv:2301.07737v1 [cs.LG])
    Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In (Lewkowycz et al., 2020) it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.  ( 2 min )
    Understanding the diffusion models by conditional expectations. (arXiv:2301.07882v1 [cs.LG])
    This paper provide several mathematical analyses of the diffusion model in machine learning. The drift term of the backwards sampling process is represented as a conditional expectation involving the data distribution and the forward diffusion. The training process aims to find such a drift function by minimizing the mean-squared residue related to the conditional expectation. Using small-time approximations of the Green's function of the forward diffusion, we show that the analytical mean drift function in DDPM and the score function in SGM asymptotically blow up in the final stages of the sampling process for singular data distributions such as those concentrated on lower-dimensional manifolds, and is therefore difficult to approximate by a network. To overcome this difficulty, we derive a new target function and associated loss, which remains bounded even for singular data distributions. We illustrate the theoretical findings with several numerical examples.  ( 2 min )
  • Open

    measure for difference between two distribution
    Hi, I'm looking for a metric that will describe the difference between 2 distributions. This is being used for classification/model selection, to pick a model that is most similar to a theoretical distribution. The distributions are empiric, that is from a non parametric bootstrap and a simulation. The differences in the distributions could be in the mean or the variance (but neither need to be normal). I've looked at Kullback-Lieber, but that is intended for a cumulative probability distribution, and that pretty much removes the mean effect since the sum of probabilities must = 1. Had some success with the kolmogorov smirnov distance, which seems useful, as well as the jensen shannon divergence and the Wasserstein distance. The Jeffreys divergence really seems like what I want, but doesn't see to exist for numerical/empiric distributions in 1 dimension. It seems that most of the metrics for differences in distributions are for probability distributions. My distributions are not that, they may, for example have a range of 1000-4000, not 0-1. Any other ideas? thanks Mark submitted by /u/marksale11 [link] [comments]  ( 41 min )

  • Open

    [D] Object detection or image classification? Training a model to recognize playing cards
    Hi all, I have been experimenting with object detection recently, using Faster R-CNN and YOLOv7 to train models on pre-existing datasets. Using a UNO card dataset I was able to quite accurately detect the type of UNO cards, based on the symbol in the top left corner. I used an object detection approach, with UNO cards only being categorized into 14 classes. Based on that, I am wondering what the best approach would be to enhance the model to use for other and more comprehensive card games. Thinking of card games like Munchkin for example, which has 1000s of different cards. For card games like this, object detection might not be the best approach having 1000s of different classes to consider. ​ The two different approaches I am considering: Using object detection, create as many classes as there are different playing cards in the game, training the model to detect every single card individually or Using object detection, use playing cards to train the model to detect the playing card itself, then use the detected playing card as input for an image classification algorithm ​ For me there are pros and cons to both methods: The first approach might be much more accurate, as it detects each card individually. On the other hand, it seems to me that it needs considerably more classes and data to feed into those classes. It also might be difficult to expand the model with more unique cards, as you would have to rerun the model every time. The second approach might not be as accurate, as it might not only detect playing cards but also identify other objects as playing cards. On the flip side, it seems to me that it is much easier to expand the model with more unique cards. ​ What might be the best approach here? Do you have a different approach to this, which might be more efficient? submitted by /u/Pallemann [link] [comments]  ( 43 min )
    [D] Speech enhancement - like Adobe Enhance/Audo Studio
    Does anyone here know how Audo / Adobe Enhance work under the hood? Just wondering what open-source tooling already exists of similar quality, likewise with data, and architectures? Would anyone be interested in whipping up something open-source that can be self-hosted? submitted by /u/NegotiationUpbeat545 [link] [comments]  ( 42 min )
    [D] Generate data that is not a dataset
    Hey everyone, I'm currently faced with the challenge of having to generate data that is deliberately not in a dataset. So if you think about the dataset as a distribution, the data points should have a possibly low probability. Additionally, each data point is a 30 dimensional vector and I know the min and max values for each dimension. How do I do that? What kinds of algorithms could I use for that? Can I somehow fit a distribution and sample low probability data points from it? Or a GAN for generating? Or are there obvious classical ML or statistical methods for that? submitted by /u/NiconiusX [link] [comments]  ( 42 min )
    [D] Not sure if time series or multiple classifications?
    I am beginning a problem similar to the one bellow for my work. There is a score 1-4 (1 is bad, 4 is very good) of a persons back sprain recovery. The data we have are back sprain recovery scores recorded after two weeks, 3 months and 6 months, along with information (features) about their behavior like sleep, medications, diet, and exercise. We want to predict there 2 week, 3 month, and 6 month back sprain recovery scores based on their initial behavior inputs. For example, given a user sleeps 8 hours a day, consumes x amount of sugar, does physical therapy 4 days a week, and takes x medication, what will there recovery scores be at 2 weeks, 3 months and 6 months? The training data would look like: ​ Sleep Average Medication Days of Physical Therapy Diet Week 2 recovery score Month 3 recovery score Month 6 recovery score 9 hours per night Advil 4 days/ week Healthy 2 3 4 5 hours per night None 0 days/week Unhealthy 1 2 2 ​ I want a model (or multiple models) to predict 3 values which is the 2 week, 3 month, and 6 month scores. I am not familiar with time series, but it seems like the data may be too sparse. Should I be using time series here, or should I create 3 classification models? submitted by /u/spiritualquestions [link] [comments]  ( 43 min )
    [R] Is there a way to combine a knowledge graph and other types of data for ML purposes?
    Hello, I really don't know how to frame this question but I wanted to ask if the was a way to integrate the relationships and nodes of a knowledge graph with recorded data. Like for example, when a knowledge graph contains information about relationships between features, can it be integrated with a dataset containing recorded or measured quantities of those features. The goal of this is to "infuse" the recorded dataset with relationships already known in the knowledge graph for some data analysis purpose. I know it sounds confusing but you can as for clarification on some details. Please help. submitted by /u/Low-Mood3229 [link] [comments]  ( 43 min )
    [D] Discrete vs. Continuous Normalizing Flows
    I'm working on developing methods for some density estimation and inverse modeling tasks on physics simulation data, and normalizing flow methods seem to be a pretty good tool for this job. I'm right now looking to implement a few different model flavors a la INNs and OT-Flow, and am interested in hearing some perspectives from people in the community who have worked with these kinds of models. What would you consider the current state of the art in normalizing flow methods? Most of what I'm finding in the discrete space seems to have converged on flavors of RealNVP, while OT-Flow seems to be the most advanced in the continuous space. Beyond the benchmark performance metrics tabulated in the literature, what can we say about when to prefer continuous vs. discrete models? It's obviously going to be problem-dependent to some degree, but are there general heuristics to be aware of here? For continuous models (and implicit layer methods more generally), where are the research threads currently at improving runtime performance? The 2020 NeurIPS tutorial on implicit layers (link) has been helpful, but it would be interesting to know how things have advanced since then. Any and all insights would be appreciated! submitted by /u/nuclear_knucklehead [link] [comments]  ( 43 min )
    [N] ESANN 2023 | Special Session on Neuro-Symbolic AI (CFP)
    Neuro-symbolic AI is a promising approach to artificial intelligence that aims to combine the strengths of symbolic reasoning and probabilistic systems. For example, combining inductive logic programming and deep learning with applications in graphs, vision, reasoning and explainability. In this special session, we will provide an overview of neuro-symbolic AI, key concepts, and current state-of-the-art techniques. We will also discuss the potential benefits and challenges of neuro-symbolic AI and its potential impact on various fields and applications. In addition to the tutorial, we welcome contributions from attendees in the context of neuro-symbolic AI. This includes but is not limited to: • Novel neuro-symbolic models and techniques • Applications of neuro-symbolic AI to real-world problems • Empirical evaluations and comparisons of neuro-symbolic AI approaches • Theoretical foundations and analysis of neuro-symbolic AI • Emerging trends and challenges in the field of neuro-symbolic AI ​ Submission guidelines: https://www.esann.org/node/6 Paper submission deadline: 2 May 2023 Conference date: 4-6 October 2023 Conference location: Crowne Plaza hotel Bruges, Belgium submitted by /u/iav_tf_h [link] [comments]  ( 42 min )
    [D] Computationally light-weight deep learning research topics?
    Hello, I am familiar with the theory of differentiable computing and SGD training having done some research work on semi-supervised learning for image classification and semantic/panoptic segmentation. In other words, I am familiar with understanding implementing state-of-the-art proposals as well as tweaking them. Now I'm an unemployed and interested in conducting some research using PyTorch and Google Colab which seems feasible only if the problem or topic at hand is relatively low-cost. So I'm asking the question: What are some deep learning (metalearning,regularization, non-supervised training) or applied DL (CV/NLP/...) topics or datasets that are lightweight enough to be researched with just one GPU? Thanks and have a nice weekend submitted by /u/iamnotlefthanded666 [link] [comments]  ( 42 min )
    [D] "Deep Learning Tuning Playbook" (recently released by Google Brain people)
    https://github.com/google-research/tuning_playbook - Google has released a playbook (solely) about how to tune hyper-parameters of neural networks. Disclaimer: I am unrelated to this repository, just came across it and thought it is suitable for this subreddit. I have searched through and found no posts, thus I post it to hear some comments/insights from you ;) submitted by /u/fzyzcjy [link] [comments]  ( 44 min )
    [D] Did YouTube just add upscaling?
    So, these pictures below are taken from a 144p video on YouTube. You cannot tell me that these aren't CNN upscaling artefacts. So this raises the question of.... how exactly is this implemented? What model are they using which is tiny enough to run on (i assume) WebGL2? Is it a CNN inside of GLSL shaders? Is it something else? CPU side or GPU side? And also... how have I not seen a single other person pointing this out, anywhere on the internet. Believe me I looked. Ain't no one talking about this. EDIT: UPDATE this is doing it in ALL videos in chrome now. It only works in Chrome, not in Discord or Edge, so its not GPU/Windows fuckery. But the strange thing is other friends testing this with the same version of Chrome ***DONT*** have this? And the even stranger thing is... this is running on Intel Integrated Graphics... https://preview.redd.it/jnjwjzyag7da1.png?width=3240&format=png&auto=webp&s=504c9fa6ba41ae3a5b1266fe17e519839d3cf933 https://preview.redd.it/6vzyx5f1g7da1.png?width=1182&format=png&auto=webp&s=b56df5017d1fb742f042c847e42818d7f05a1888 https://preview.redd.it/bo36ko40g7da1.png?width=365&format=png&auto=webp&s=1777e238a7299084da9e10eb62c6c2539dc5cc86 https://preview.redd.it/16zpxwqyf7da1.png?width=333&format=png&auto=webp&s=38b9f2f4eb1a5999ad212aab6247e41a913a5294 submitted by /u/Avelina9X [link] [comments]  ( 46 min )
    [N] OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic
    https://time.com/6247678/openai-chatgpt-kenya-workers/ submitted by /u/ChubChubkitty [link] [comments]  ( 54 min )
    [P] paper-hero: Yet Another Paper Search Tool
    Hi guys, thanks for reading this post. I built a simplistic paper search tool that integrates ACL Anthology, arXiv API, and DBLP API. Github address: Spico197/paper-hero Motivation: I'm majoring NLP and I'd like to search for papers with "Event Extraction" as titles in specific proceedings (e.g. ACL, EMNLP). Challenge: There are lots of search tools and APIs, but few of them provide field-specific searches, like authors, titles, abstracts, and venues. Methodology: I integrate ACL Anthology, arXiv API, and DBLP API, and provide a two-stage search toolkit, which first stores target papers via the official fuzzy search API, and then matches specific fields. Advantages: This tool satisfies my need to stockpile papers and it can dump checklists in markdown format, or complete paper information in jsonl. AND and OR logics are supported in search queries. Limitations: This tool is based on simple string matching, so you have to know some terminologies in the target fields. You are warmly welcome to have a try and feel free to drop me an issue! from src.interfaces.aclanthology import AclanthologyPaperList from src.utils import dump_paper_list_to_markdown_checklist if __name__ == "__main__": # use `bash scripts/get_aclanthology.sh` to download and prepare anthology data first paper_list = AclanthologyPaperList("cache/aclanthology.json") ee_query = { "title": [ # Any of the strings below is matched ["information extraction"], ["event", "extraction"], # title must include `event` and `extraction` ["event", "argument", "extraction"], ["event", "detection"], ["event", "classification"], ["event", "tracking"], ["event", "relation", "extraction"], ], # Besides the title constraint, venue must also meet the needs "venue": [ ["acl"], ["emnlp"], ["naacl"], ["coling"], ["findings"], ["tacl"], ["cl"], ], } ee_papers = paper_list.search(ee_query) dump_paper_list_to_markdown_checklist(ee_papers, "results/ee-paper-list.md") ​ markdown checklist submitted by /u/Spico197 [link] [comments]  ( 44 min )
  • Open

    Hey anyone know where I could find a AI assisted writer with no world limit, or at least a VERY high one
    Title submitted by /u/Zan_korida [link] [comments]  ( 40 min )
    Consider this: You know how chatGPT says load failed when its giving YOU the most groundbreaking answer ever? I bet: It’s sent to OpenAI, you get “Load failed” and a less profound answer on regenerate.
    It’s the free preview. No breakthroughs for you. They collect amazing breakthroughs by the minute from humans working the AI. Just add a weight for profoundness 1-10. At 7, crash and send. User never gets it. Thoughts? submitted by /u/Overall-Importance54 [link] [comments]  ( 40 min )
    AI art - automation. A working artist's take.
    submitted by /u/WSCOKN [link] [comments]  ( 40 min )
    This website was created by an AI chatbot, and all of the content was generated by an AI image generator.
    submitted by /u/FreePixelArt [link] [comments]  ( 40 min )
    Amazon Wants To Help Community Colleges with AI
    Amazon has launched an "educator enablement" program to help instructors at community colleges, HBCUs, and other minority-serving institutions learn and teach AI. The professional development program will help college instructors gain a generalist AI skillset. Amazon will provide $1,200 and continuing education credits to 330 participants who complete one of the six boot camps being offered over the course of 2023. For colleges that don't get selected for the educator enablement cohort, Amazon plans to make curriculum materials for any interested college at no cost through Github, YouTube, and AWS Academy This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/google-brings-in-legendary-duo-for-chatgpt-battle submitted by /u/Mk_Makanaki [link] [comments]  ( 40 min )
    "Sentient AI" - Example Of Just How Easy It Is To Prompt A Fake Sentient AI(GPT3)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 40 min )
    AI WARS: Explaining Google's painfully long, 15k-word! tome about their AI plans... or lack thereof
    We covered this in our newsletter today. Here it is verbatim-- if you find it useful, hit the link and sub: https://smokingrobot.beehiiv.com/p/ai-wars ​ Microsoft has dominated BIG TECH headlines over the last few months, thanks largely to a drumbeat of headlines involving their partner OpenAI and its world-shaping ChatGPT. So awe-striking is Microsoft's hand right now, it has made rival companies' advancements, like Apple's recently announced and insanely powerful M2 MacBook Pros, look pedestrian in comparison. But now Google has entered the chat. And by "entered the chat", we mean that CEO Sundar Pichai - Pich-AI? - released a distressingly long 15,000-word(!) treatise on its own endeavors in AI, signaling a counter attack... maybe... at some point in the future... when and if it…  ( 45 min )
    A ChatGPT software engineer in your pocket
    Having AI tools like ChatGPT is like having a personal software engineer in your pocket. But most people don't know how to craft prompts for code. Here's how you can get AI to write software for you: ​ https://preview.redd.it/tfkoz3x4v8da1.png?width=686&format=png&auto=webp&s=11a88815cb44e68fa2a5f63aa99f5867ac0aa755 submitted by /u/Imagine-your-success [link] [comments]  ( 40 min )
    DREAMBOOTH: 10 MINS TRAINING Inside Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Powerful Tools to Test and Improve your Chatbot! 🔥
    submitted by /u/Marinuch [link] [comments]  ( 40 min )
    Walking simulators
    For a while Ive been seeing a lot of videos like these: https://youtu.be/wqvAconYgK0 https://youtu.be/qvpXpCvkqbc https://youtu.be/kQ2bqz3HPJE And Ive been wondering what program these might be using, and if its open to the public. If not is there any program that any of you might suggest to achieve identical if not similar results as I have a few ideas on how this could be utilised within a game engine and animation workspace Thanks submitted by /u/SyhrNewo [link] [comments]  ( 40 min )
    Google plans chatbot search engine and 20 new AI products
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    🚀Online Real-Time Volumetric Nerf + Slam
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Master Thesis about AI-designed jewelry
    Dear all! My master's thesis is coming to an end, and now I am seeking valuable insights and important data into my topic through a survey. This survey aims to understand consumer awareness and purchase intentions regarding jewelry created with the help of artificial intelligence. You can find the link here: https://ucpresearch.qualtrics.com/.../SV_ex2OALp6fPXQ8Qe I would be grateful if you participated in large numbers and provide me with valuable insights on the current topic. Thank you!! submitted by /u/Ok-Rise453 [link] [comments]  ( 40 min )
    ChatGPT Trend
    submitted by /u/Realistic-Plant3957 [link] [comments]  ( 40 min )
    Are TensorFlow and other ML frameworks worth learning in 2023?
    For some explanation, I am more familiar with PyTorch but I wanted to refresh my knowledge of machine learning and deep learning concepts. However, with the recent trend of moving away from TensorFlow and towards PyTorch, I wonder if TensorFlow and other ML frameworks are still worth learning today. I know certain algorithms are exclusively implemented in one or the other framework. I think at least TensorFlow is good since they’re well-documented and put to the test by other experts. But I’m not so sure about the case of newer custom frameworks. If I dive into them, it could be a step back for me especially after reading this article. It talks about newer ML framework launches and how a lot of people try it out at first, but then interest in said frameworks starts to decrease. I know there are a bunch of good custom frameworks out there but it might take more time for new tools to become mainstream or eventually die down. Which is why I’m afraid to use them at the moment. Let me know what you all think! Thanks! submitted by /u/ActionParticular7697 [link] [comments]  ( 41 min )
    Boston Dynamics reveals Atlas AI robot new ability to grip + autonomously manipulate objects | New 3D modeling Geocode AI creates + Edits highly realistic meshes | Breakthrough Text-To-Video "Tune a Video" uses diffusion models to output coherent video
    submitted by /u/SedatelyMake [link] [comments]  ( 40 min )
    Space Locator
    Hey all. My first post here. I’ll keep it as relevant as possible. I want to develop a AI model which will determine that if the space in a room is enough or not. Like it’ll determine whether the space in the room in enough or not. We’ll provide the pictures of the room and it’ll give us the output. I think there might be some pre-trained models out there which might be helpful. Please guide me in this regard if there are some models or where should I start. I’ll be grateful. Thank you so much in advance submitted by /u/h3artb3att [link] [comments]  ( 40 min )
    Porter Robinson music continued by OpenAI Jukebox
    submitted by /u/anoneemoosh [link] [comments]  ( 40 min )
    ChatGPT Accepted As Co-Author On Multiple Research Papers
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    Advancements in Natural Language Processing (NLP) and its applications in various industries
    Natural Language Processing (NLP) is a rapidly growing field within the realm of Artificial Intelligence (AI) that is revolutionizing the way we interact with machines. NLP is a branch of AI that deals with the interaction between human language and computers. It is used to analyze, understand and generate human language, and it has a wide range of applications in various industries. One of the most significant advancements in NLP is the development of deep learning algorithms, which have greatly improved the accuracy and efficiency of NLP models. These algorithms have enabled the development of more sophisticated NLP systems, such as those that can understand context and generate human-like responses. One of the most prominent applications of NLP is in the customer service industry. Com…  ( 42 min )
    TextCortex AI: AI Writing Companion - (Our free browser extension)
    submitted by /u/Ruzuyu [link] [comments]  ( 40 min )
    How do you get AI art generators to produce amazing images that look like real art? Take a text-guided diffusion model and feed it the ideal text prompt with the right keywords
    Paper: https://arxiv.org/abs/2209.11711 Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. https://preview.redd.it/ei4wm8tv66da1.png?width=1852&format=png&auto=webp&s=e20c196fe9a543e6b4ec58b3d4f689624db7f95d submitted by /u/Ok_Mine_5742 [link] [comments]  ( 40 min )
    Created by Stable Diffusion
    submitted by /u/NorthTs [link] [comments]  ( 40 min )
  • Open

    Road Map to Machine Learning & Deep Learning
    A Good Road Map To Machine Learning enginner  ( 14 min )
    Sleep disorders: can AI and Digital Twin help?
    According to the National Sleep Foundation, it is estimated that 50–70 million adults in the United States have a sleep disorder.  ( 25 min )
    Top 10 AI Applications in HRM
    Artificial Intelligence (AI) is revolutionizing the way businesses operate, and the field of Human Resource Management (HRM) is no…  ( 7 min )
    Unlocking the Power of Time Series Forecasting: A Step-by-Step Guide with Code Examples in Python
    Time series forecasting is the process of using a model to predict future values of a time series based on its past values. Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 10 min )
  • Open

    Exposing Reliability Degradation and Mitigation in Approximate DNNs under Permanent Faults
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
    Machine vision within airports
    My first idea was to use simple object recognition on the x-ray machines in airports to detect weapons. As blunt objects are hard for humans to detect, and so are plastic weapons. Surely this could be solved with a simple object recognition algorithm. I then started researching airport security, and came across, without a doubt, the safest airport in the world, the Israeli airport. They have so many incredibly invasive processes, such as interviews with highly trained personnel, who are trained to spot a liar, and officers following you around if you are deemed "high-risk" (sidenote: they are also unapologetically racist about who they deem as high-risk). Couldn't both of those processes be automated? The interviewing could be done by a non-trained worker, then have cameras that analyse the person. And the officers following people around could also be done using CCTV and some machine vision. Or even analyse people when they're walking around I'm not saying that the Iranian airport should do this, as it would be a step-down, which they clearly will not accept. Instead could this not be done in western airports, as there have been many reports indicating the lack of success when it comes to them catching anyone? Could they implement poor versions of the Iranian airport security system? ​ P.S. I recently visited Senegal (west Africa). On the way back from Senegal to the UK I was totally unaware that I had a large water bottle in my bag, and the screening was so useless that they never found it. The guy manning the scanner was watching TikTok on his phone. When I landed I then could have gone anywhere in the world without being checked again, which goes to show how much of a waste of time the current airport security is. submitted by /u/Tom_nerd [link] [comments]  ( 41 min )
    Which gpu would be better for training? Linux, PyTorch, I have a gigabyte rtx 2070 and amp FirePro s9300 x2?
    Amd* submitted by /u/DaOnlyBaby [link] [comments]  ( 40 min )
    🚀Online Real-Time Volumetric Nerf + Slam
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    How do you get AI art generators to produce amazing images that look like real art? Take a text-guided diffusion model and feed it the ideal text prompt with the right keywords
    Paper: https://arxiv.org/abs/2209.11711 Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. https://preview.redd.it/p47ovsfc66da1.png?width=1852&format=png&auto=webp&s=516eda2724c4f254b59109fd46a65dba8ffd79a1 submitted by /u/Ok_Mine_5742 [link] [comments]  ( 40 min )
  • Open

    What are the current state-of-the-art algorithms?
    What algorithms are currently considered state-of-the-art? I’m specifically interested in those which are off-policy as I have found DQN to be the best choice for my application so far. Some algorithms I’m considering trying are R2D2, Duelling DQN and Rainbow DQN. Are there any others that could be worth a look? submitted by /u/centripetalstranger [link] [comments]  ( 40 min )
    In RL, how does one provide a theoretical justification of why one algorithm works better than the other?
    Completely random example, let's say that your experiments consistently demonstrate that a recurrent policy (LSTM) in PPO works better than a linear policy, more specifically in one kind of environments (say, environments that require cooperation between agents). Now, how do you justify theoretically your empirical finding? In other words, how do you explain the theory behind the linear policy's limitations? submitted by /u/No_Possibility_7588 [link] [comments]  ( 43 min )
    DQN for simple toy env
    Dear RL community, ​ I'm trying to get DQN (using the stable baselines 3 implementation) to solve my toy environment. No matter how many different hyperparameter configurations I try, I can't get it to work. ​ Here are some details about the environment (with num_objects=1). - the agent is essentially controlling a gripper that moves to one of the squares, grasps the item, and then moves to the target location. - it's a grid world set on 2 levels with each level of size 3x3 - the available actions are one for each square (the agent is teleported there) and 2 more for grasping and releasing. total of 20 discrete actions. - the reward provides a signal as the distance between the object and the target goal, a big reward if succeeded or -1 when it does something bad. ​ Here's my training code. I've also added a human_step function that can be used to control the gripper yourself. ​ Do you have any insights as to why it's not working? Please ask if you need more details about any part of the implementation. ​ Many thanks! submitted by /u/Mr_Physic13 [link] [comments]  ( 41 min )
    How to proceed scientifically when your hypothesis is falsified?
    I predicted that a certain change in the architecture of my agents would boost their coordination (in the context of multi-agent reinforcement learning). However, I tested this in the Meetup environment and it is not working, in the sense that it performs slightly worse than the baseline. This is how the environment works: three agents must collectively choose one of K landmarks and congregate near it. At each time step, each agent receives reward equal to the change in distance between itself and the landmark closest to all three agents. The goal landmark changes depending on the current position of all agents. When all K agents are adjacent to the same landmark, the agents receive a bonus of 1 and the episode ends. Scientifically speaking, how can I be rigorous about testing this hypothesis again? A few ideas: 1 Repeat the experiment multiple times with different random seeds to ensure that the results are robust and not influenced by random variations. 2 Vary the parameters of the agent Vary the number of modules used in the policy and test the effect on coordination. Increase the number of agents 3 Vary the parameters of the environment Changing the number of landmarks Adding distractors 4 Test another environment What do you think? - submitted by /u/No_Possibility_7588 [link] [comments]  ( 42 min )
    Continuous action space that should often return an exact value that's inputted (attempt with reinforce algorithm)
    I was wondering if this is bad practice or bound to fail. I'm in an environment where the state is x1 plus a bunch of other things call them x2. Sometimes the best action is x1 (and exactly x1). Sometimes the best action is a function of x2 (and x1). What I was thinking (and am trying to so far little success) is to have the reinforce algorithm output two gaussian action values. The first, a1, is f(x1,x2). The second, a2, is put into a sigmoid function and if it's greater than 0.5, use action f(x2,x1) and otherwise x1. It seems weird to me that in the reinforce algorithm I'd still use the log probability of both action values if a2<0.5 in which case a1 doesn't get used at all. In that case should I find the probability only of a2 and use that instead? If this idea is completely off base and you have suggestions please lmk. submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Environment for General AI using Reinforcement Learning?
    I have many custom environments with these features: Observation, Valid Actions at a specific Observation, Sparse reward when the episode is terminated, either 1 or 0 I want to build a Reinforcement Learning Agent that can perform well in these environments. I started this personal project 2 years ago and it get harder the more I try to do it. Is there any other environments like this, or what paper can I read more about this kind of Reinforcement Learning ? submitted by /u/Open_Ranger4375 [link] [comments]  ( 41 min )
    agent not learning using dqn.
    Hello forum, I am trying to get a single joint actuated link to stand upright. I am using dqn as a method for the agent to learn. I have tried using different inputs and outputs with the neural net but is still failing to learn. Can you take a look at my code? #!/usr/bin/env python3 from interbotix_xs_modules.arm import InterbotixManipulatorXS import rospy import time import numpy as np import matplotlib.pyplot as plt import torch import torch.nn as nn import torch.optim as optim import math import random import matplotlib.pyplot as plt ​ from std_msgs.msg import Float64 from gazebo_msgs.msg import LinkStates from geometry_msgs.msg import Pose, Twist from std_srvs.srv import Empty from sensor_msgs.msg import JointState ​ bot = InterbotixManipulatorXS(robot_model="rx15…  ( 44 min )
  • Open

    ­­How CCC Intelligent Solutions created a custom approach for hosting complex AI models using Amazon SageMaker
    This post is co-written by Christopher Diaz, Sam Kinard, Jaime Hidalgo and Daniel Suarez  from CCC Intelligent Solutions. In this post, we discuss how CCC Intelligent Solutions (CCC) combined Amazon SageMaker with other AWS services to create a custom solution capable of hosting the types of complex artificial intelligence (AI) models envisioned. CCC is a […]  ( 13 min )
  • Open

    Oval orbits?
    Johannes Kepler thought that planetary orbits were ellipses. Giovanni Cassini thought they were ovals. Kepler was right, but Cassini wasn’t far off. In everyday speech, people use the words ellipse and oval interchangeably. But in mathematics these terms are distinct. There is one definition of an ellipse, and several definitions of an oval. To be […] Oval orbits? first appeared on John D. Cook.  ( 6 min )
    Cassini ovals
    An ellipse can be defined as the set of points such that the sum of the distances to two fixed points, the foci, has a constant value. A Cassini oval is the set of points such that the product of the distances to two foci has a constant value. You can write down an equation […] Cassini ovals first appeared on John D. Cook.  ( 5 min )
    Bounds on power series coefficients
    Let f be an analytic function on the unit disk with f(0) = 0 and derivative f ′(0) = 1. If f is one-to-one (injective) then this puts a strict limit on the size of the series coefficients. Let an be the nth coefficient in the power series for f centered at 0. If f is one-to-one […] Bounds on power series coefficients first appeared on John D. Cook.  ( 5 min )
  • Open

    What Is AI Computing?
    The abacus, sextant, slide rule and computer. Mathematical instruments mark the history of human progress. They’ve enabled trade and helped navigate oceans, and advanced understanding and quality of life. The latest tool propelling science and industry is AI computing. AI Computing Defined AI computing is the math-intensive process of calculating machine learning algorithms, typically using Read article >  ( 8 min )
  • Open

    MIT researchers develop an AI model that can detect future lung cancer risk
    Deep-learning model takes a personalized approach to assessing each patient’s risk of lung cancer based on CT scans.  ( 10 min )
  • Open

    Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation
    Reinforcement learning provides a conceptual framework for autonomous agents to learn from experience, analogously to how one might train a pet with treats. But practical applications of reinforcement learning are often far from natural: instead of using RL to learn through trial and error by actually attempting the desired task, typical RL applications use a separate (usually simulated) training phase. For example, AlphaGo did not learn to play Go by competing against thousands of humans, but rather by playing against itself in simulation. While this kind of simulated training is appealing for games where the rules are perfectly known, applying this to real world domains such as robotics can require a range of complex approaches, such as the use of simulated data, or instrumenting real-wo…  ( 4 min )
  • Open

    Reverse Differentiation via Predictive Coding. (arXiv:2103.04689v3 [cs.LG] UPDATED)
    Deep learning has redefined the field of artificial intelligence (AI) thanks to the rise of artificial neural networks, which are architectures inspired by their neurological counterpart in the brain. Through the years, this dualism between AI and neuroscience has brought immense benefits to both fields, allowing neural networks to be used in dozens of applications. These networks use an efficient implementation of reverse differentiation, called backpropagation (BP). This algorithm, however, is often criticized for its biological implausibility (e.g., lack of local update rules for the parameters). Therefore, biologically plausible learning methods that rely on predictive coding (PC), a framework for describing information processing in the brain, are increasingly studied. Recent works prove that these methods can approximate BP up to a certain margin on multilayer perceptrons (MLPs), and asymptotically on any other complex model, and that zero-divergence inference learning (Z-IL), a variant of PC, is able to exactly implement BP on MLPs. However, the recent literature shows also that there is no biologically plausible method yet that can exactly replicate the weight update of BP on complex models. To fill this gap, in this paper, we generalize (PC and) Z-IL by directly defining them on computational graphs, and show that it can perform exact reverse differentiation. What results is the first biologically plausible algorithm that is equivalent to BP in the way of updating parameters on any neural network, providing a bridge between the interdisciplinary research of neuroscience and deep learning.  ( 2 min )
    Digital Twin-Based Multiple Access Optimization and Monitoring via Model-Driven Bayesian Learning. (arXiv:2210.05582v2 [eess.SP] UPDATED)
    Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control and monitor software-based, "open", communication systems, which play the role of the physical twin (PT). In the general framework presented in this work, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL) and monitoring of the PT for anomaly detection. We specifically investigate the application of the proposed framework to a simple case-study system encompassing multiple sensing devices that report to a common receiver. The Bayesian model trained at the DT has the key advantage of capturing epistemic uncertainty regarding the communication system, e.g., regarding current traffic conditions, which arise from limited PT-to-DT data transfer. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.  ( 2 min )
    How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact. (arXiv:2106.02359v3 [cs.CL] UPDATED)
    Recent years have seen many breakthroughs in natural language processing (NLP), transitioning it from a mostly theoretical field to one with many real-world applications. Noting the rising number of applications of other machine learning and AI techniques with pervasive societal impact, we anticipate the rising importance of developing NLP technologies for social good. Inspired by theories in moral philosophy and global priorities research, we aim to promote a guideline for social good in the context of NLP. We lay the foundations via the moral philosophy definition of social good, propose a framework to evaluate the direct and indirect real-world impact of NLP tasks, and adopt the methodology of global priorities research to identify priority causes for NLP research. Finally, we use our theoretical framework to provide some practical guidelines for future NLP research for social good. Our data and code are available at this http URL In addition, we curate a list of papers and resources on NLP for social good at https://github.com/zhijing-jin/NLP4SocialGood_Papers.  ( 2 min )
    - Modelling Difference Between Censored and Uncensored Electric Vehicle Charging Demand. (arXiv:2301.06418v2 [cs.AI] UPDATED)
    Electric vehicle charging demand models, with charging records as input, will inherently be biased toward the supply of available chargers, as the data do not include demand lost from occupied stations and competitors. This lost demand implies that the records only observe a fraction of the total demand, i.e. the observations are censored, and actual demand is likely higher than what the data reflect. Machine learning models often neglect to account for this censored demand when forecasting the charging demand, which limits models' applications for future expansions and supply management. We address this gap by modelling the charging demand with probabilistic censorship-aware graph neural networks, which learn the latent demand distribution in both the spatial and temporal dimensions. We use GPS trajectories from cars in Copenhagen, Denmark, to study how censoring occurs and much demand is lost due to occupied charging and competing services. We find that censorship varies throughout the city and over time, encouraging spatial and temporal modelling. We find that in some regions of Copenhagen, censorship occurs 61% of the time. Our results show censorship-aware models provide better prediction and uncertainty estimation in actual future demand than censorship-unaware models. Our results suggest that future models based on charging records should account for the censoring to expand the application areas of machine learning models in this supply management and infrastructure expansion.  ( 2 min )
    Scalable Deep Graph Clustering with Random-walk based Self-supervised Learning. (arXiv:2112.15530v2 [cs.LG] UPDATED)
    Web-based interactions can be frequently represented by an attributed graph, and node clustering in such graphs has received much attention lately. Multiple efforts have successfully applied Graph Convolutional Networks (GCN), though with some limits on accuracy as GCNs have been shown to suffer from over-smoothing issues. Though other methods (particularly those based on Laplacian Smoothing) have reported better accuracy, a fundamental limitation of all the work is a lack of scalability. This paper addresses this open problem by relating the Laplacian smoothing to the Generalized PageRank and applying a random-walk based algorithm as a scalable graph filter. This forms the basis for our scalable deep clustering algorithm, RwSL, where through a self-supervised mini-batch training mechanism, we simultaneously optimize a deep neural network for sample-cluster assignment distribution and an autoencoder for a clustering-oriented embedding. Using 6 real-world datasets and 6 clustering metrics, we show that RwSL achieved improved results over several recent baselines. Most notably, we show that RwSL, unlike all other deep clustering frameworks, can continue to scale beyond graphs with more than one million nodes, i.e., handle web-scale. We also demonstrate how RwSL could perform node clustering on a graph with 1.8 billion edges using only a single GPU.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v2 [math.ST] UPDATED)
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Adversarial AI in Insurance: Pervasiveness and Resilience. (arXiv:2301.07520v1 [cs.LG])
    The rapid and dynamic pace of Artificial Intelligence (AI) and Machine Learning (ML) is revolutionizing the insurance sector. AI offers significant, very much welcome advantages to insurance companies, and is fundamental to their customer-centricity strategy. It also poses challenges, in the project and implementation phase. Among those, we study Adversarial Attacks, which consist of the creation of modified input data to deceive an AI system and produce false outputs. We provide examples of attacks on insurance AI applications, categorize them, and argue on defence methods and precautionary systems, considering that they can involve few-shot and zero-shot multilabelling. A related topic, with growing interest, is the validation and verification of systems incorporating AI and ML components. These topics are discussed in various sections of this paper.  ( 2 min )
    Global Contrastive Batch Sampling via Optimization on Sample Permutations. (arXiv:2210.12874v3 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining, Global Contrastive Batch Sampling (GCBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$, in contrastive learning settings. Through experimentation we find GCBS improves state-of-the-art performance in sentence embedding and code-search tasks. Additionally, GCBS is easy to implement as it requires only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient than the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Weight Matrix Dimensionality Reduction in Deep Learning via Kronecker Multi-layer Architectures. (arXiv:2204.04273v2 [cs.LG] UPDATED)
    Deep learning using neural networks is an effective technique for generating models of complex data. However, training such models can be expensive when networks have large model capacity resulting from a large number of layers and nodes. For training in such a computationally prohibitive regime, dimensionality reduction techniques ease the computational burden, and allow implementations of more robust networks. We propose a novel type of such dimensionality reduction via a new deep learning architecture based on fast matrix multiplication of a Kronecker product decomposition; in particular our network construction can be viewed as a Kronecker product-induced sparsification of an "extended" fully connected network. Analysis and practical examples show that this architecture allows a neural network to be trained and implemented with a significant reduction in computational time and resources, while achieving a similar error level compared to a traditional feedforward neural network.  ( 2 min )
    Dirichlet-Neumann learning algorithm for solving elliptic interface problems. (arXiv:2301.07361v1 [math.NA])
    Non-overlapping domain decomposition methods are natural for solving interface problems arising from various disciplines, however, the numerical simulation requires technical analysis and is often available only with the use of high-quality grids, thereby impeding their use in more complicated situations. To remove the burden of mesh generation and to effectively tackle with the interface jump conditions, a novel mesh-free scheme, i.e., Dirichlet-Neumann learning algorithm, is proposed in this work to solve the benchmark elliptic interface problem with high-contrast coefficients as well as irregular interfaces. By resorting to the variational principle, we carry out a rigorous error analysis to evaluate the discrepancy caused by the boundary penalty treatment for each decomposed subproblem, which paves the way for realizing the Dirichlet-Neumann algorithm using neural network extension operators. The effectiveness and robustness of our proposed methods are demonstrated experimentally through a series of elliptic interface problems, achieving better performance over other alternatives especially in the presence of erroneous flux prediction at interface.  ( 2 min )
    CLIPTER: Looking at the Bigger Picture in Scene Text Recognition. (arXiv:2301.07464v1 [cs.CV])
    Understanding the scene is often essential for reading text in real-world scenarios. However, current scene text recognizers operate on cropped text images, unaware of the bigger picture. In this work, we harness the representative power of recent vision-language models, such as CLIP, to provide the crop-based recognizer with scene, image-level information. Specifically, we obtain a rich representation of the entire image and fuse it with the recognizer word-level features via cross-attention. Moreover, a gated mechanism is introduced that gradually shifts to the context-enriched representation, enabling simply fine-tuning a pretrained recognizer. We implement our model-agnostic framework, named CLIPTER - CLIP Text Recognition, on several leading text recognizers and demonstrate consistent performance gains, achieving state-of-the-art results over multiple benchmarks. Furthermore, an in-depth analysis reveals improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.  ( 2 min )
    Prompting Large Language Model for Machine Translation: A Case Study. (arXiv:2301.07069v2 [cs.CL] UPDATED)
    Research on prompting has shown excellent performance with little or even no supervised training across many tasks. However, prompting for machine translation is still under-explored in the literature. We fill this gap by offering a systematic study on prompting strategies for translation, examining various factors for prompt template and demonstration example selection. We further explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning in prompting. Extensive experiments with GLM-130B (Zeng et al., 2022) as the testbed show that 1) the number and the quality of prompt examples matter, where using suboptimal examples degenerates translation; 2) several features of prompt examples, such as semantic similarity, show significant Spearman correlation with their prompting performance; yet, none of the correlations are strong enough; 3) using pseudo parallel prompt examples constructed from monolingual data via zero-shot prompting could improve translation; and 4) improved performance is achievable by transferring knowledge from prompt examples selected in other settings. We finally provide an analysis on the model outputs and discuss several problems that prompting still suffers from.  ( 2 min )
    Neural DAEs: Constrained neural networks. (arXiv:2211.14302v2 [cs.LG] UPDATED)
    In this article we investigate the effect of explicitly adding auxiliary trajectory information to neural networks for dynamical systems. We draw inspiration from the field of differential-algebraic equations and differential equations on manifolds and implement similar methods in residual neural networks. We discuss constraints through stabilization as well as projection methods, and show when to use which method based on experiments involving simulations of multi-body pendulums and molecular dynamics scenarios. Several of our methods are easy to implement in existing code and have limited impact on training performance while giving significant boosts in terms of inference.  ( 2 min )
    Nostradamus: Weathering Worth. (arXiv:2212.05933v2 [q-fin.ST] UPDATED)
    Nostradamus, inspired by the French astrologer and reputed seer, is a detailed study exploring relations between environmental factors and changes in the stock market. In this paper, we analyze associative correlation and causation between environmental elements (including natural disasters, climate and weather conditions) and stock prices, using historical stock market data, historical climate data, and various climate indicators such as carbon dioxide emissions. We have conducted our study based on the US financial market, global climate trends, and daily weather records to demonstrate a significant relationship between climate and stock price fluctuation. Our analysis covers both short-term and long-term rises and dips in company stock performances. Lastly, we take four natural disasters as a case study to observe the effect they have on people's emotional state and their influence on the stock market.  ( 2 min )
    Quantification of geogrid lateral restraint using transparent sand and deep learning-based image segmentation. (arXiv:2212.02939v2 [physics.geo-ph] UPDATED)
    An experimental technique is presented to quantify the lateral restraint provided by a geogrid embedded in granular soil at the particle level. Repeated load triaxial tests were done on transparent sand specimens with geosynthetic inclusions simulating geogrids. Particle outlines on laser illuminated planes through the specimens were segmented using a deep learning-based segmentation algorithm. The particle outlines were characterized in terms of Fourier shape descriptors and tracked across sequentially captured images. The accuracy of the particle displacement measurements was validated against Digital Image Correlation (DIC) measurements. In addition, the method's resolution and repeatability is presented. Based on the measured particle displacements and rotations, a state boundary line between probable and improbable particle motions was identified for each test. The size of the zone of probable motions could be used to quantify the lateral restraint provided by the inclusions. Overall, the tests results revealed that the geosynthetic inclusions restricted both particle displacements and rotations. However, the particle displacements were found to be restrained more significantly than the rotations. Finally, a unique relationship was found between the magnitude of the permanent strains of the specimens and the size of the zone of probable motions.  ( 2 min )
    Generalized Many-Body Dispersion Correction through Random-phase Approximation for Chemically Accurate Density Functional Theory. (arXiv:2210.09784v4 [physics.chem-ph] UPDATED)
    We extend our recently proposed Deep Learning-aided many-body dispersion (DNN-MBD) model to quadrupole polarizability (Q) terms using a generalized Random Phase Approximation (RPA) formalism, thus enabling the inclusion of van der Waals contributions beyond dipole. The resulting DNN-MBDQ model only relies on ab initio-derived quantities as the introduced quadrupole polarizabilities are recursively retrieved from dipole ones, in turn modelled via the Tkatchenko-Scheffler method. A transferable and efficient deep-neuronal network (DNN) provides atom in molecule volumes, while a single range-separation parameter is used to couple the model to Density Functional Theory (DFT). Since it can be computed at a negligible cost, the DNN-MBDQ approach can be coupled with DFT functionals such as PBE,PBE0 and B86bPBE (dispersionless). The DNN-MBQ-corrected functionals reach chemical accuracy while exhibiting lower errors compared to their dipole-only counterparts.  ( 2 min )
    Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models. (arXiv:2301.06267v2 [cs.CV] UPDATED)
    The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.  ( 2 min )
    Spiking Neural Network Decision Feedback Equalization. (arXiv:2211.04756v3 [eess.SP] UPDATED)
    In the past years, artificial neural networks (ANNs) have become the de-facto standard to solve tasks in communications engineering that are difficult to solve with traditional methods. In parallel, the artificial intelligence community drives its research to biology-inspired, brain-like spiking neural networks (SNNs), which promise extremely energy-efficient computing. In this paper, we investigate the use of SNNs in the context of channel equalization for ultra-low complexity receivers. We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE). For conversion of real-world data into spike signals we introduce a novel ternary encoding and compare it with traditional log-scale encoding. We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels. We highlight that mainly the conversion of the channel output to spikes introduces a small performance penalty. The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.  ( 2 min )
    Antenna Array Calibration Via Gaussian Process Models. (arXiv:2301.06582v2 [eess.SP] UPDATED)
    Antenna array calibration is necessary to maintain the high fidelity of beam patterns across a wide range of advanced antenna systems and to ensure channel reciprocity in time division duplexing schemes. Despite the continuous development in this area, most existing solutions are optimised for specific radio architectures, require standardised over-the-air data transmission, or serve as extensions of conventional methods. The diversity of communication protocols and hardware creates a problematic case, since this diversity requires to design or update the calibration procedures for each new advanced antenna system. In this study, we formulate antenna calibration in an alternative way, namely as a task of functional approximation, and address it via Bayesian machine learning. Our contributions are three-fold. Firstly, we define a parameter space, based on near-field measurements, that captures the underlying hardware impairments corresponding to each radiating element, their positional offsets, as well as the mutual coupling effects between antenna elements. Secondly, Gaussian process regression is used to form models from a sparse set of the aforementioned near-field data. Once deployed, the learned non-parametric models effectively serve to continuously transform the beamforming weights of the system, resulting in corrected beam patterns. Lastly, we demonstrate the viability of the described methodology for both digital and analog beamforming antenna arrays of different scales and discuss its further extension to support real-time operation with dynamic hardware impairments.  ( 2 min )
    Teacher Forcing Recovers Reward Functions for Text Generation. (arXiv:2210.08708v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) has been widely used in text generation to alleviate the exposure bias issue or to utilize non-parallel datasets. The reward function plays an important role in making RL training successful. However, previous reward functions are typically task-specific and sparse, restricting the use of RL. In our work, we propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing. We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function. Empirical results show that our method outperforms self-training and reward regression methods on several text generation tasks, confirming the effectiveness of our reward function.  ( 2 min )
    Planning and Learning with Adaptive Lookahead. (arXiv:2201.12403v2 [cs.LG] UPDATED)
    Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.  ( 2 min )
    An Analysis of Loss Functions for Binary Classification and Regression. (arXiv:2301.07638v1 [stat.ML])
    This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. It is shown that a large class of margin-based loss functions for binary classification/regression result in estimating scores equivalent to log-likelihood scores weighted by an even function. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses, including exponential loss, logistic loss, and others. The characterization is used to construct a new Huber-type loss function for the logistic model. A simple relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals. The relation provides new, straightforward interpretations for exponential and logistic loss, and aids in understanding why exponential loss is sensitive to outliers. In particular, it is shown that minimizing empirical exponential loss is equivalent to minimizing the sum of squared standardized logistic regression residuals. The relation also provides new insight into the AdaBoost algorithm.  ( 2 min )
    Operator Learning Framework for Digital Twin and Complex Engineering Systems. (arXiv:2301.06701v2 [cs.LG] UPDATED)
    With modern computational advancements and statistical analysis methods, machine learning algorithms have become a vital part of engineering modeling. Neural Operator Networks (ONets) is an emerging machine learning algorithm as a "faster surrogate" for approximating solutions to partial differential equations (PDEs) due to their ability to approximate mathematical operators versus the direct approximation of Neural Networks (NN). ONets use the Universal Approximation Theorem to map finite-dimensional inputs to infinite-dimensional space using the branch-trunk architecture, which encodes domain and feature information separately before using a dot product to combine the information. ONets are expected to occupy a vital niche for surrogate modeling in physical systems and Digital Twin (DT) development. Three test cases are evaluated using ONets for operator approximation, including a 1-dimensional ordinary differential equations (ODE), general diffusion system, and convection-diffusion (Burger) system. Solutions for ODE and diffusion systems yield accurate and reliable results (R2>0.95), while solutions for Burger systems need further refinement in the ONet algorithm.  ( 2 min )
    Joint Representation Learning for Text and 3D Point Cloud. (arXiv:2301.07584v1 [cs.CV])
    Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns task-specific 3D representations under informative language guidance from the label set without 2D images. Extensive experiments demonstrate that our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection. The code will be available here: https://github.com/LeapLabTHU/Text4Point  ( 2 min )
    A Bayesian Framework for Digital Twin-Based Control, Monitoring, and Data Collection in Wireless Systems. (arXiv:2212.01351v2 [eess.SP] UPDATED)
    Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control, monitor, and analyze software-based, "open", communication systems. Notably, DT platforms provide a sandbox in which to test artificial intelligence (AI) solutions for communication systems, potentially reducing the need to collect data and test algorithms in the field, i.e., on the physical twin (PT). A key challenge in the deployment of DT systems is to ensure that virtual control optimization, monitoring, and analysis at the DT are safe and reliable, avoiding incorrect decisions caused by "model exploitation". To address this challenge, this paper presents a general Bayesian framework with the aim of quantifying and accounting for model uncertainty at the DT that is caused by limitations in the amount and quality of data available at the DT from the PT. In the proposed framework, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL), monitoring of the PT for anomaly detection, prediction, data-collection optimization, and counterfactual analysis. To exemplify the application of the proposed framework, we specifically investigate a case-study system encompassing multiple sensing devices that report to a common receiver. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.  ( 2 min )
    Learning Task-Oriented Communication for Edge Inference: An Information Bottleneck Approach. (arXiv:2102.04170v3 [eess.SP] UPDATED)
    This paper investigates task-oriented communication for edge inference, where a low-end edge device transmits the extracted feature vector of a local data sample to a powerful edge server for processing. It is critical to encode the data into an informative and compact representation for low-latency inference given the limited bandwidth. We propose a learning-based communication scheme that jointly optimizes feature extraction, source coding, and channel coding in a task-oriented manner, i.e., targeting the downstream inference task rather than data reconstruction. Specifically, we leverage an information bottleneck (IB) framework to formalize a rate-distortion tradeoff between the informativeness of the encoded feature and the inference performance. As the IB optimization is computationally prohibitive for the high-dimensional data, we adopt a variational approximation, namely the variational information bottleneck (VIB), to build a tractable upper bound. To reduce the communication overhead, we leverage a sparsity-inducing distribution as the variational prior for the VIB framework to sparsify the encoded feature vector. Furthermore, considering dynamic channel conditions in practical communication systems, we propose a variable-length feature encoding scheme based on dynamic neural networks to adaptively adjust the activated dimensions of the encoded feature to different channel conditions. Extensive experiments evidence that the proposed task-oriented communication system achieves a better rate-distortion tradeoff than baseline methods and significantly reduces the feature transmission latency in dynamic channel conditions.  ( 2 min )
    Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition. (arXiv:2202.07901v2 [cs.LG] UPDATED)
    Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types -- such as images and time-series data (e.g., audio or text data) -- requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show improved classification accuracy, faster convergence, and better generalizability.  ( 2 min )
    Theseus: A Library for Differentiable Nonlinear Optimization. (arXiv:2207.09442v3 [cs.RO] UPDATED)
    We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several example applications that are built using the same underlying differentiable components, such as second-order optimizers, standard costs functions, and Lie groups. For efficiency, Theseus incorporates support for sparse solvers, automatic vectorization, batching, GPU acceleration, and gradient computation with implicit differentiation and direct loss minimization. We do extensive performance evaluation in a set of applications, demonstrating significant efficiency gains and better scalability when these features are incorporated. Project page: https://sites.google.com/view/theseus-ai  ( 2 min )
    Non-parametric identifiability and sensitivity analysis of synthetic control models. (arXiv:2301.07656v1 [stat.ME])
    Quantifying cause and effect relationships is an important problem in many domains. The gold standard solution is to conduct a randomised controlled trial. However, in many situations such trials cannot be performed. In the absence of such trials, many methods have been devised to quantify the causal impact of an intervention from observational data given certain assumptions. One widely used method are synthetic control models. While identifiability of the causal estimand in such models has been obtained from a range of assumptions, it is widely and implicitly assumed that the underlying assumptions are satisfied for all time periods both pre- and post-intervention. This is a strong assumption, as synthetic control models can only be learned in pre-intervention period. In this paper we address this challenge, and prove identifiability can be obtained without the need for this assumption, by showing it follows from the principle of invariant causal mechanisms. Moreover, for the first time, we formulate and study synthetic control models in Pearl's structural causal model framework. Importantly, we provide a general framework for sensitivity analysis of synthetic control causal inference to violations of the assumptions underlying non-parametric identifiability. We end by providing an empirical demonstration of our sensitivity analysis framework on simulated and real data in the widely-used linear synthetic control framework.  ( 2 min )
    Comprehensive Literature Survey on Deep Learning used in Image Memorability Prediction and Modification. (arXiv:2301.06080v2 [cs.CV] UPDATED)
    As humans, we can remember certain visuals in great detail, and sometimes even after viewing them once. What is even more interesting is that humans tend to remember and forget the same things, suggesting that there might be some general internal characteristics of an image to encode and discard similar types of information. Research suggests that some pictures tend to be memorized more than others. The ability of an image to be remembered by different viewers is one of its intrinsic properties. In visualization and photography, creating memorable images is a difficult task. Hence, to solve the problem, various techniques predict visual memorability and manipulate images' memorability. We present a comprehensive literature survey to assess the deep learning techniques used to predict and modify memorability. In particular, we analyze the use of Convolutional Neural Networks, Recurrent Neural Networks, and Generative Adversarial Networks for image memorability prediction and modification.  ( 2 min )
    Factors other than climate change are currently more important in predicting how well fruit farms are doing financially. (arXiv:2301.07685v1 [cs.LG])
    Machine learning and statistical modeling methods were used to analyze the impact of climate change on financial wellbeing of fruit farmers in Tunisia and Chile. The analysis was based on face to face interviews with 801 farmers. Three research questions were investigated. First, whether climate change impacts had an effect on how well the farm was doing financially. Second, if climate change was not influential, what factors were important for predicting financial wellbeing of the farm. And third, ascertain whether observed effects on the financial wellbeing of the farm were a result of interactions between predictor variables. This is the first report directly comparing climate change with other factors potentially impacting financial wellbeing of farms. Certain climate change factors, namely increases in temperature and reductions in precipitation, can regionally impact self-perceived financial wellbeing of fruit farmers. Specifically, increases in temperature and reduction in precipitation can have a measurable negative impact on the financial wellbeing of farms in Chile. This effect is less pronounced in Tunisia. Climate impact differences were observed within Chile but not in Tunisia. However, climate change is only of minor importance for predicting farm financial wellbeing, especially for farms already doing financially well. Factors that are more important, mainly in Tunisia, included trust in information sources and prior farm ownership. Other important factors include farm size, water management systems used and diversity of fruit crops grown. Moreover, some of the important factors identified differed between farms doing and not doing well financially. Interactions between factors may improve or worsen farm financial wellbeing.  ( 2 min )
    An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. (arXiv:2103.15990v7 [cs.HC] UPDATED)
    With the rapid development of the internet of things (IoT) and artificial intelligence (AI) technologies, human activity recognition (HAR) has been applied in a variety of domains such as security and surveillance, human-robot interaction, and entertainment. Even though a number of surveys and review papers have been published, there is a lack of HAR overview papers focusing on healthcare applications that use wearable sensors. Therefore, we fill in the gap by presenting this overview paper. In particular, we present our projects to illustrate the system design of HAR applications for healthcare. Our projects include early mobility identification of human activities for intensive care unit (ICU) patients and gait analysis of Duchenne muscular dystrophy (DMD) patients. We cover essential components of designing HAR systems including sensor factors (e.g., type, number, and placement location), AI model selection (e.g., classical machine learning models versus deep learning models), and feature engineering. In addition, we highlight the challenges of such healthcare-oriented HAR systems and propose several research opportunities for both the medical and the computer science community.  ( 2 min )
    Improving Federated Learning Personalization via Model Agnostic Meta Learning. (arXiv:1909.12488v2 [cs.LG] UPDATED)
    Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this work, we point out that the setting of Model Agnostic Meta Learning (MAML), where one optimizes for a fast, gradient-based, few-shot adaptation to a heterogeneous distribution of tasks, has a number of similarities with the objective of personalization for FL. We present FL as a natural source of practical applications for MAML algorithms, and make the following observations. 1) The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. 2) Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 3) A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. These results raise new questions for FL, MAML, and broader ML research.  ( 2 min )
    Creating awareness about security and safety on highways to mitigate wildlife-vehicle collisions by detecting and recognizing wildlife fences using deep learning and drone technology. (arXiv:2301.07174v1 [cs.CV])
    In South Africa, it is a common practice for people to leave their vehicles beside the road when traveling long distances for a short comfort break. This practice might increase human encounters with wildlife, threatening their security and safety. Here we intend to create awareness about wildlife fencing, using drone technology and computer vision algorithms to recognize and detect wildlife fences and associated features. We collected data at Amakhala and Lalibela private game reserves in the Eastern Cape, South Africa. We used wildlife electric fence data containing single and double fences for the classification task. Additionally, we used aerial and still annotated images extracted from the drone and still cameras for the segmentation and detection tasks. The model training results from the drone camera outperformed those from the still camera. Generally, poor model performance is attributed to (1) over-decompression of images and (2) the ability of drone cameras to capture more details on images for the machine learning model to learn as compared to still cameras that capture only the front view of the wildlife fence. We argue that our model can be deployed on client-edge devices to inform people about the presence and significance of wildlife fencing, which minimizes human encounters with wildlife, thereby mitigating wildlife-vehicle collisions.  ( 2 min )
    Enhancing Self-Training Methods. (arXiv:2301.07294v1 [cs.LG])
    Semi-supervised learning approaches train on small sets of labeled data along with large sets of unlabeled data. Self-training is a semi-supervised teacher-student approach that often suffers from the problem of "confirmation bias" that occurs when the student model repeatedly overfits to incorrect pseudo-labels given by the teacher model for the unlabeled data. This bias impedes improvements in pseudo-label accuracy across self-training iterations, leading to unwanted saturation in model performance after just a few iterations. In this work, we describe multiple enhancements to improve the self-training pipeline to mitigate the effect of confirmation bias. We evaluate our enhancements over multiple datasets showing performance gains over existing self-training design choices. Finally, we also study the extendability of our enhanced approach to Open Set unlabeled data (containing classes not seen in labeled data).  ( 2 min )
    Safety Verification of Neural Network Control Systems Using Guaranteed Neural Network Model Reduction. (arXiv:2301.07531v1 [cs.LG])
    This paper aims to enhance the computational efficiency of safety verification of neural network control systems by developing a guaranteed neural network model reduction method. First, a concept of model reduction precision is proposed to describe the guaranteed distance between the outputs of a neural network and its reduced-size version. A reachability-based algorithm is proposed to accurately compute the model reduction precision. Then, by substituting a reduced-size neural network controller into the closed-loop system, an algorithm to compute the reachable set of the original system is developed, which is able to support much more computationally efficient safety verification processes. Finally, the developed methods are applied to a case study of the Adaptive Cruise Control system with a neural network controller, which is shown to significantly reduce the computational time of safety verification and thus validate the effectiveness of the method.  ( 2 min )
    Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams. (arXiv:2202.08312v3 [cs.LG] UPDATED)
    Motivated by recent applications requiring differential privacy over adaptive streams, we investigate the question of optimal instantiations of the matrix mechanism in this setting. We prove fundamental theoretical results on the applicability of matrix factorizations to adaptive streams, and provide a parameter-free fixed-point algorithm for computing optimal factorizations. We instantiate this framework with respect to concrete matrices which arise naturally in machine learning, and train user-level differentially private models with the resulting optimal mechanisms, yielding significant improvements in a notable problem in federated learning with user-level differential privacy.  ( 2 min )
    Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. (arXiv:2210.07675v3 [cs.CV] UPDATED)
    We present a system for anomaly detection in histopathological images. In histology, normal samples are usually abundant, whereas anomalous (pathological) cases are scarce or not available. Under such settings, one-class classifiers trained on healthy data can detect out-of-distribution anomalous samples. Such approaches combined with pre-trained Convolutional Neural Network (CNN) representations of images were previously employed for anomaly detection (AD). However, pre-trained off-the-shelf CNN representations may not be sensitive to abnormal conditions in tissues, while natural variations of healthy tissue may result in distant representations. To adapt representations to relevant details in healthy tissue we propose training a CNN on an auxiliary task that discriminates healthy tissue of different species, organs, and staining reagents. Almost no additional labeling workload is required, since healthy samples come automatically with aforementioned labels. During training we enforce compact image representations with a center-loss term, which further improves representations for AD. The proposed system outperforms established AD methods on a published dataset of liver anomalies. Moreover, it provided comparable results to conventional methods specifically tailored for quantification of liver anomalies. We show that our approach can be used for toxicity assessment of candidate drugs at early development stages and thereby may reduce expensive late-stage drug attrition.  ( 2 min )
    Universal Neural-Cracking-Machines: Self-Configurable Password Models from Auxiliary Data. (arXiv:2301.07628v1 [cs.CR])
    We develop the first universal password model -- a password model that, once pre-trained, can automatically adapt to any password distribution. To achieve this result, the model does not need to access any plaintext passwords from the target set. Instead, it exploits users' auxiliary information, such as email addresses, as a proxy signal to predict the underlying target password distribution. The model uses deep learning to capture the correlation between the auxiliary data of a group of users (e.g., users of a web application) and their passwords. It then exploits those patterns to create a tailored password model for the target community at inference time. No further training steps, targeted data collection, or prior knowledge of the community's password distribution is required. Besides defining a new state-of-the-art for password strength estimation, our model enables any end-user (e.g., system administrators) to autonomously generate tailored password models for their systems without the often unworkable requirement of collecting suitable training data and fitting the underlying password model. Ultimately, our framework enables the democratization of well-calibrated password models to the community, addressing a major challenge in the deployment of password security solutions on a large scale.  ( 2 min )
    Feature Alignment as a Generative Process. (arXiv:2106.12562v2 [cs.LG] UPDATED)
    Reversibility in artificial neural networks allows us to retrieve the input given an output. We present feature alignment, a method for approximating reversibility in arbitrary neural networks. We train a network by minimizing the distance between the output of a data point and the random output with respect to a random input. We applied the technique to the MNIST, CIFAR-10, CelebA and STL-10 image datasets. We demonstrate that this method can roughly recover images from just their latent representation without the need of a decoder. By utilizing the formulation of variational autoencoders, we demonstrate that it is possible to produce new images that are statistically comparable to the training data. Furthermore, we demonstrate that the quality of the images can be improved by coupling a generator and a discriminator together. In addition, we show how this method, with a few minor modifications, can be used to train networks locally, which has the potential to save computational memory resources.  ( 2 min )
    Cross-Domain Evaluation of a Deep Learning-Based Type Inference System. (arXiv:2208.09189v2 [cs.SE] UPDATED)
    Optional type annotations allow for enriching dynamic programming languages with static typing features like better Integrated Development Environment (IDE) support, more precise program analysis, and early detection and prevention of type-related runtime errors. Machine learning-based type inference promises interesting results for automating this task. However, the practical usage of such systems depends on their ability to generalize across different domains, as they are often applied outside their training domain. In this work, we investigate Type4Py as a representative of state-of-the-art deep learning-based type inference systems, by conducting extensive cross-domain experiments. Thereby, we address the following problems: class imbalances, out-of-vocabulary words, dataset shifts, and unknown classes. To perform such experiments, we use the datasets ManyTypes4Py and CrossDomainTypes4Py. The latter we introduce in this paper. Our dataset enables the evaluation of type inference systems in different domains of software projects and has over 1,000,000 type annotations mined on the platforms GitHub and Libraries. It consists of data from the two domains web development and scientific calculation. Through our experiments, we detect that the shifts in the dataset and the long-tailed distribution with many rare and unknown data types decrease the performance of the deep learning-based type inference system drastically. In this context, we test unsupervised domain adaptation methods and fine-tuning to overcome these issues. Moreover, we investigate the impact of out-of-vocabulary words.  ( 2 min )
    InstructPix2Pix: Learning to Follow Image Editing Instructions. (arXiv:2211.09800v2 [cs.CV] UPDATED)
    We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.  ( 2 min )
    Sample Complexity of Adversarially Robust Linear Classification on Separated Data. (arXiv:2012.10794v3 [cs.LG] UPDATED)
    We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. Motivated by some real applications, we consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least $\Omega(\frac{d}{n})$, whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$. This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that for very well-separated data, convergence rates of $O(\frac{1}{n})$ are achievable, which is not the case otherwise. Our results apply to robustness measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).  ( 2 min )
    An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms. (arXiv:2301.07665v1 [cs.SD])
    In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand. These representations can be used to manipulate the timbre and influence the synthesis of creative instrumental notes. Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument timbre compression. Unsupervised deep learning methods can achieve audio compression by training the network to learn a mapping from waveforms or spectrograms to low-dimensional representations. This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch. Further exploration of hyper-parameters and regularization techniques is demonstrated to enhance the performance of the initial design. In an unsupervised manner, the network is able to reconstruct a monophonic and harmonic sound based on latent representations. In addition, we introduce an evaluation metric to measure the similarity between the original and reconstructed samples. Evaluating a deep generative model for the synthesis of sound is a challenging task. Our approach is based on the accuracy of the generated frequencies as it presents a significant metric for the perception of harmonic sounds. This work is expected to accelerate future experiments on audio compression using neural autoencoders.  ( 2 min )
    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v1 [eess.SP] CROSS LISTED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )
    Concrete Score Matching: Generalized Score Matching for Discrete Data. (arXiv:2211.00802v2 [cs.LG] UPDATED)
    Representing probability distributions by the gradient of their density functions has proven effective in modeling a wide range of continuous data modalities. However, this representation is not applicable in discrete domains where the gradient is undefined. To this end, we propose an analogous score function called the "Concrete score", a generalization of the (Stein) score for discrete settings. Given a predefined neighborhood structure, the Concrete score of any input is defined by the rate of change of the probabilities with respect to local directional changes of the input. This formulation allows us to recover the (Stein) score in continuous domains when measuring such changes by the Euclidean distance, while using the Manhattan distance leads to our novel score function in discrete domains. Finally, we introduce a new framework to learn such scores from samples called Concrete Score Matching (CSM), and propose an efficient training objective to scale our approach to high dimensions. Empirically, we demonstrate the efficacy of CSM on density estimation tasks on a mixture of synthetic, tabular, and high-dimensional image datasets, and demonstrate that it performs favorably relative to existing baselines for modeling discrete data.  ( 2 min )
    Multimodal learning with graphs. (arXiv:2209.03299v5 [cs.LG] UPDATED)
    Artificial intelligence for graphs has achieved remarkable success in modeling complex systems, ranging from dynamic networks in biology to interacting particle systems in physics. However, the increasingly heterogeneous graph datasets call for multimodal methods that can combine different inductive biases: the set of assumptions that algorithms use to make predictions for inputs they have not encountered during training. Learning on multimodal datasets presents fundamental challenges because the inductive biases can vary by data modality and graphs might not be explicitly given in the input. To address these challenges, multimodal graph AI methods combine different modalities while leveraging cross-modal dependencies using graphs. Diverse datasets are combined using graphs and fed into sophisticated multimodal architectures, specified as image-intensive, knowledge-grounded and language-intensive models. Using this categorization, we introduce a blueprint for multimodal graph learning, use it to study existing methods and provide guidelines to design new models.  ( 2 min )
    Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning. (arXiv:2206.02450v2 [cs.IT] UPDATED)
    Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N\llL variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N^2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of...  ( 3 min )
    Hybrid quantum-classical convolutional neural networks to improve molecular protein binding affinity predictions. (arXiv:2301.06331v2 [quant-ph] UPDATED)
    One of the main challenges in drug discovery is to find molecules that bind specifically and strongly to their target protein while having minimal binding to other proteins. By predicting binding affinity, it is possible to identify the most promising candidates from a large pool of potential compounds, reducing the number of compounds that need to be tested experimentally. Recently, deep learning methods have shown superior performance than traditional computational methods for making accurate predictions on large datasets. However, the complexity and time-consuming nature of these methods have limited their usage and development. Quantum machine learning is an emerging technology that has the potential to improve many classical machine learning algorithms. In this work we present a hybrid quantum-classical convolutional neural network, which is able to reduce by 20% the complexity of the classical network while maintaining optimal performance in the predictions. Additionally, it results in a significant time savings of up to 40% in the training process, which means a meaningful speed up of the drug discovery process.  ( 2 min )
    Consistent Non-Parametric Methods for Maximizing Robustness. (arXiv:2102.09086v3 [cs.LG] UPDATED)
    Learning classifiers that are robust to adversarial examples has received a great deal of recent attention. A major drawback of the standard robust learning framework is there is an artificial robustness radius $r$ that applies to all inputs. This ignores the fact that data may be highly heterogeneous, in which case it is plausible that robustness regions should be larger in some regions of data, and smaller in others. In this paper, we address this limitation by proposing a new limit classifier, called the neighborhood optimal classifier, that extends the Bayes optimal classifier outside its support by using the label of the closest in-support point. We then argue that this classifier maximizes the size of its robustness regions subject to the constraint of having accuracy equal to the Bayes optimal. We then present sufficient conditions under which general non-parametric methods that can be represented as weight functions converge towards this limit, and show that both nearest neighbors and kernel classifiers satisfy them under certain conditions.  ( 2 min )
    ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training. (arXiv:2301.07482v1 [cs.LG])
    A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the inevitable shortcuts involved. To address these limitations, we instead propose ReFresh, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, ReFresh is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 4.6x up to 23.6x and reduce the memory access by 64.5% (85.7% higher than a raw feature cache), with less than 1% influence on test accuracy.  ( 2 min )
    Strong inductive biases provably prevent harmless interpolation. (arXiv:2301.07605v1 [stat.ML])
    Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -- a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.  ( 2 min )
    What relations are reliably embeddable in Euclidean space?. (arXiv:1903.05347v3 [cs.LG] UPDATED)
    We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed.  ( 2 min )
    Neural Network Quantization for Efficient Inference: A Survey. (arXiv:2112.06126v2 [cs.LG] UPDATED)
    As neural networks have become more powerful, there has been a rising desire to deploy them in the real world; however, the power and accuracy of neural networks is largely due to their depth and complexity, making them difficult to deploy, especially in resource-constrained devices. Neural network quantization has recently arisen to meet this demand of reducing the size and complexity of neural networks by reducing the precision of a network. With smaller and simpler networks, it becomes possible to run neural networks within the constraints of their target hardware. This paper surveys the many neural network quantization techniques that have been developed in the last decade. Based on this survey and comparison of neural network quantization techniques, we propose future directions of research in the area.  ( 2 min )
    Prediction of Red Wine Quality Using One-dimensional Convolutional Neural Networks. (arXiv:2208.14008v2 [cs.LG] UPDATED)
    As an alcoholic beverage, wine has remained prevalent for thousands of years, and the quality assessment of wines has been significant in wine production and trade. Scholars have proposed various deep learning and machine learning algorithms for wine quality prediction, such as Support vector machine (SVM), Random Forest (RF), K-nearest neighbors (KNN), Deep neural network (DNN), and Logistic regression (LR). However, these methods ignore the inner relationship between the physical and chemical properties of the wine, for example, the correlations between pH values, fixed acidity, citric acid, and so on. To fill the gap, this paper conducts the Pearson correlation analysis, PCA analysis, and Shapiro-Wilk test on those properties and incorporates 1D-CNN architecture to capture the correlations among neighboring features. In addition, it implemented dropout and batch normalization techniques to improve the robustness of the proposed model. Massive experiments have shown that our method can outperform baseline approaches in wine quality prediction. Moreover, ablation experiments also demonstrate the effectiveness of incorporating the 1-D CNN module, Dropout, and normalization techniques.  ( 2 min )
    Performance-Preserving Event Log Sampling for Predictive Monitoring. (arXiv:2301.07624v1 [cs.LG])
    Predictive process monitoring is a subfield of process mining that aims to estimate case or event features for running process instances. Such predictions are of significant interest to the process stakeholders. However, most of the state-of-the-art methods for predictive monitoring require the training of complex machine learning models, which is often inefficient. Moreover, most of these methods require a hyper-parameter optimization that requires several repetitions of the training process which is not feasible in many real-life applications. In this paper, we propose an instance selection procedure that allows sampling training process instances for prediction models. We show that our instance selection procedure allows for a significant increase of training speed for next activity and remaining time prediction methods while maintaining reliable levels of prediction accuracy.  ( 2 min )
    Concentration of polynomial random matrices via Efron-Stein inequalities. (arXiv:2209.02655v2 [cs.CC] UPDATED)
    Analyzing concentration of large random matrices is a common task in a wide variety of fields. Given independent random variables, many tools are available to analyze random matrices whose entries are linear in the variables, e.g. the matrix-Bernstein inequality. However, in many applications, we need to analyze random matrices whose entries are polynomials in the variables. These arise naturally in the analysis of spectral algorithms, e.g., Hopkins et al. [STOC 2016], Moitra-Wein [STOC 2019]; and in lower bounds for semidefinite programs based on the Sum of Squares hierarchy, e.g. Barak et al. [FOCS 2016], Jones et al. [FOCS 2021]. In this work, we present a general framework to obtain such bounds, based on the matrix Efron-Stein inequalities developed by Paulin-Mackey-Tropp [Annals of Probability 2016]. The Efron-Stein inequality bounds the norm of a random matrix by the norm of another simpler (but still random) matrix, which we view as arising by "differentiating" the starting matrix. By recursively differentiating, our framework reduces the main task to analyzing far simpler matrices. For Rademacher variables, these simpler matrices are in fact deterministic and hence, analyzing them is far easier. For general non-Rademacher variables, the task reduces to scalar concentration, which is much easier. Moreover, in the setting of polynomial matrices, our results generalize the work of Paulin-Mackey-Tropp. Using our basic framework, we recover known bounds in the literature for simple "tensor networks" and "dense graph matrices". Using our general framework, we derive bounds for "sparse graph matrices", which were obtained only recently by Jones et al. [FOCS 2021] using a nontrivial application of the trace power method, and was a core component in their work. We expect our framework to be helpful for other applications involving concentration phenomena for nonlinear random matrices.  ( 3 min )
    Multimodal Side-Tuning for Document Classification. (arXiv:2301.07502v1 [cs.LG])
    In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.  ( 2 min )
    Non-IID Quantum Federated Learning with One-shot Communication Complexity. (arXiv:2209.00768v2 [quant-ph] UPDATED)
    Federated learning refers to the task of machine learning based on decentralized data from multiple clients with secured data privacy. Recent studies show that quantum algorithms can be exploited to boost its performance. However, when the clients' data are not independent and identically distributed (IID), the performance of conventional federated algorithms is known to deteriorate. In this work, we explore the non-IID issue in quantum federated learning with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. This observation leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. Numerical simulations show that the proposed algorithm outperforms the conventional ones significantly under non-IID settings.  ( 2 min )
    Targeted Image Reconstruction by Sampling Pre-trained Diffusion Model. (arXiv:2301.07557v1 [cs.LG])
    A trained neural network model contains information on the training data. Given such a model, malicious parties can leverage the "knowledge" in this model and design ways to print out any usable information (known as model inversion attack). Therefore, it is valuable to explore the ways to conduct a such attack and demonstrate its severity. In this work, we proposed ways to generate a data point of the target class without prior knowledge of the exact target distribution by using a pre-trained diffusion model.  ( 2 min )
    A Novel, Scale-Invariant, Differentiable, Efficient, Scalable Regularizer. (arXiv:2301.07285v1 [cs.LG])
    $L_{p}$-norm regularization schemes such as $L_{0}$, $L_{1}$, and $L_{2}$-norm regularization and $L_{p}$-norm-based regularization techniques such as weight decay and group LASSO compute a quantity which de pends on model weights considered in isolation from one another. This paper describes a novel regularizer which is not based on an $L_{p}$-norm. In contrast with $L_{p}$-norm-based regularization, this regularizer is concerned with the spatial arrangement of weights within a weight matrix. This regularizer is an additive term for the loss function and is differentiable, simple and fast to compute, scale-invariant, requires a trivial amount of additional memory, and can easily be parallelized. Empirically this method yields approximately a one order-of-magnitude improvement in the number of nonzero model parameters at a given level of accuracy.  ( 2 min )
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v3 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that approximates the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples at virtually no cost. However, the accuracy of amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increase the resilience of amortized variational inference in the presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and the (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in the number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty at approximately the same cost as five reverse-time migrations.  ( 2 min )
    Failure Tolerant Training with Persistent Memory Disaggregation over CXL. (arXiv:2301.07492v1 [cs.AR])
    This paper proposes TRAININGCXL that can efficiently process large-scale recommendation datasets in the pool of disaggregated memory while making training fault tolerant with low overhead. To this end, i) we integrate persistent memory (PMEM) and GPU into a cache-coherent domain as Type-2. Enabling CXL allows PMEM to be directly placed in GPU's memory hierarchy, such that GPU can access PMEM without software intervention. TRAININGCXL introduces computing and checkpointing logic near the CXL controller, thereby training data and managing persistency in an active manner. Considering PMEM's vulnerability, ii) we utilize the unique characteristics of recommendation models and take the checkpointing overhead off the critical path of their training. Lastly, iii) TRAININGCXL employs an advanced checkpointing technique that relaxes the updating sequence of model parameters and embeddings across training batches. The evaluation shows that TRAININGCXL achieves 5.2x training performance improvement and 76% energy savings, compared to the modern PMEM-based recommendation systems.  ( 2 min )
    Compression of GPS Trajectories using Autoencoders. (arXiv:2301.07420v1 [cs.LG])
    The ubiquitous availability of mobile devices capable of location tracking led to a significant rise in the collection of GPS data. Several compression methods have been developed in order to reduce the amount of storage needed while keeping the important information. In this paper, we present an lstm-autoencoder based approach in order to compress and reconstruct GPS trajectories, which is evaluated on both a gaming and real-world dataset. We consider various compression ratios and trajectory lengths. The performance is compared to other trajectory compression algorithms, i.e., Douglas-Peucker. Overall, the results indicate that our approach outperforms Douglas-Peucker significantly in terms of the discrete Fr\'echet distance and dynamic time warping. Furthermore, by reconstructing every point lossy, the proposed methodology offers multiple advantages over traditional methods.  ( 2 min )
    No-substitution k-means Clustering with Adversarial Order. (arXiv:2012.14512v2 [cs.DS] UPDATED)
    We investigate $k$-means clustering in the online no-substitution setting when the input arrives in \emph{arbitrary} order. In this setting, points arrive one after another, and the algorithm is required to instantly decide whether to take the current point as a center before observing the next point. Decisions are irrevocable. The goal is to minimize both the number of centers and the $k$-means cost. Previous works in this setting assume that the input's order is random, or that the input's aspect ratio is bounded. It is known that if the order is arbitrary and there is no assumption on the input, then any algorithm must take all points as centers. Moreover, assuming a bounded aspect ratio is too restrictive -- it does not include natural input generated from mixture models. We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order. We design a new random algorithm and prove that if applied on data with complexity $d$, the algorithm takes $O(d\log(n) k\log(k))$ centers and is an $O(k^3)$-approximation. We also prove that if the data is sampled from a ``natural" distribution, such as a mixture of $k$ Gaussians, then the new complexity measure is equal to $O(k^2\log(n))$. This implies that for data generated from those distributions, our new algorithm takes only $\text{poly}(k\log(n))$ centers and is a $\text{poly}(k)$-approximation. In terms of negative results, we prove that the number of centers needed to achieve an $\alpha$-approximation is at least $\Omega\left(\frac{d}{k\log(n\alpha)}\right)$.  ( 2 min )
    Quantum-inspired tensor network for Earth science. (arXiv:2301.07528v1 [physics.geo-ph])
    Deep Learning (DL) is one of many successful methodologies to extract informative patterns and insights from ever increasing noisy large-scale datasets (in our case, satellite images). However, DL models consist of a few thousand to millions of training parameters, and these training parameters require tremendous amount of electrical power for extracting informative patterns from noisy large-scale datasets (e.g., computationally expensive). Hence, we employ a quantum-inspired tensor network for compressing trainable parameters of physics-informed neural networks (PINNs) in Earth science. PINNs are DL models penalized by enforcing the law of physics; in particular, the law of physics is embedded in DL models. In addition, we apply tensor decomposition to HyperSpectral Images (HSIs) to improve their spectral resolution. A quantum-inspired tensor network is also the native formulation to efficiently represent and train quantum machine learning models on big datasets on GPU tensor cores. Furthermore, the key contribution of this paper is twofold: (I) we reduced a number of trainable parameters of PINNs by using a quantum-inspired tensor network, and (II) we improved the spectral resolution of remotely-sensed images by employing tensor decomposition. As a benchmark PDE, we solved Burger's equation. As practical satellite data, we employed HSIs of Indian Pine, USA and of Pavia University, Italy.  ( 2 min )
    Landscape Complexity for the Empirical Risk of Generalized Linear Models. (arXiv:1912.02143v5 [stat.ML] UPDATED)
    We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. Under a technical hypothesis, we obtain a rigorous explicit variational formula for the annealed complexity, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the quenched complexity, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v5 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    A Robust Classification Framework for Byzantine-Resilient Stochastic Gradient Descent. (arXiv:2301.07498v1 [cs.LG])
    This paper proposes a Robust Gradient Classification Framework (RGCF) for Byzantine fault tolerance in distributed stochastic gradient descent. The framework consists of a pattern recognition filter which we train to be able to classify individual gradients as Byzantine by using their direction alone. This filter is robust to an arbitrary number of Byzantine workers for convex as well as non-convex optimisation settings, which is a significant improvement on the prior work that is robust to Byzantine faults only when up to 50% of the workers are Byzantine. This solution does not require an estimate of the number of Byzantine workers; its running time is not dependent on the number of workers and can scale up to training instances with a large number of workers without a loss in performance. We validate our solution by training convolutional neural networks on the MNIST dataset in the presence of Byzantine workers.  ( 2 min )
    Autonomous Slalom Maneuver Based on Expert Drivers' Behavior Using Convolutional Neural Network. (arXiv:2301.07424v1 [cs.RO])
    Lane changing and obstacle avoidance are one of the most important tasks in automated cars. To date, many algorithms have been suggested that are generally based on path trajectory or reinforcement learning approaches. Although these methods have been efficient, they are not able to accurately imitate a smooth path traveled by an expert driver. In this paper, a method is presented to mimic drivers' behavior using a convolutional neural network (CNN). First, seven features are extracted from a dataset gathered from four expert drivers in a driving simulator. Then, these features are converted from 1D arrays to 2D arrays and injected into a CNN. The CNN model computes the desired steering wheel angle and sends it to an adaptive PD controller. Finally, the control unit applies proper torque to the steering wheel. Results show that the CNN model can mimic the drivers' behavior with an R2-squared of 0.83. Also, the performance of the presented method was evaluated in the driving simulator for 17 trials, which avoided all traffic cones successfully. In some trials, the presented method performed a smoother maneuver compared to the expert drivers.  ( 2 min )
    A Survey of Advanced Computer Vision Techniques for Sports. (arXiv:2301.07583v1 [cs.CV])
    Computer Vision developments are enabling significant advances in many fields, including sports. Many applications built on top of Computer Vision technologies, such as tracking data, are nowadays essential for every top-level analyst, coach, and even player. In this paper, we survey Computer Vision techniques that can help many sports-related studies gather vast amounts of data, such as Object Detection and Pose Estimation. We provide a use case for such data: building a model for shot speed estimation with pose data obtained using only Computer Vision models. Our model achieves a correlation of 67%. The possibility of estimating shot speeds enables much deeper studies about enabling the creation of new metrics and recommendation systems that will help athletes improve their performance, in any sport. The proposed methodology is easily replicable for many technical movements and is only limited by the availability of video data.  ( 2 min )
    Physics-informed Information Field Theory for Modeling Physical Systems with Uncertainty Quantification. (arXiv:2301.07609v1 [stat.ML])
    Data-driven approaches coupled with physical knowledge are powerful techniques to model systems. The goal of such models is to efficiently solve for the underlying field by combining measurements with known physical laws. As many systems contain unknown elements, such as missing parameters, noisy data, or incomplete physical laws, this is widely approached as an uncertainty quantification problem. The common techniques to handle all the variables typically depend on the numerical scheme used to approximate the posterior, and it is desirable to have a method which is independent of any such discretization. Information field theory (IFT) provides the tools necessary to perform statistics over fields that are not necessarily Gaussian. We extend IFT to physics-informed IFT (PIFT) by encoding the functional priors with information about the physical laws which describe the field. The posteriors derived from this PIFT remain independent of any numerical scheme and can capture multiple modes, allowing for the solution of problems which are ill-posed. We demonstrate our approach through an analytical example involving the Klein-Gordon equation. We then develop a variant of stochastic gradient Langevin dynamics to draw samples from the joint posterior over the field and model parameters. We apply our method to numerical examples with various degrees of model-form error and to inverse problems involving nonlinear differential equations. As an addendum, the method is equipped with a metric which allows the posterior to automatically quantify model-form uncertainty. Because of this, our numerical experiments show that the method remains robust to even an incorrect representation of the physics given sufficient data. We numerically demonstrate that the method correctly identifies when the physics cannot be trusted, in which case it automatically treats learning the field as a regression problem.  ( 2 min )
    Adaptively Integrated Knowledge Distillation and Prediction Uncertainty for Continual Learning. (arXiv:2301.07316v1 [cs.CV])
    Current deep learning models often suffer from catastrophic forgetting of old knowledge when continually learning new knowledge. Existing strategies to alleviate this issue often fix the trade-off between keeping old knowledge (stability) and learning new knowledge (plasticity). However, the stability-plasticity trade-off during continual learning may need to be dynamically changed for better model performance. In this paper, we propose two novel ways to adaptively balance model stability and plasticity. The first one is to adaptively integrate multiple levels of old knowledge and transfer it to each block level in the new model. The second one uses prediction uncertainty of old knowledge to naturally tune the importance of learning new knowledge during model training. To our best knowledge, this is the first time to connect model prediction uncertainty and knowledge distillation for continual learning. In addition, this paper applies a modified CutMix particularly to augment the data for old knowledge, further alleviating the catastrophic forgetting issue. Extensive evaluations on the CIFAR100 and the ImageNet datasets confirmed the effectiveness of the proposed method for continual learning.  ( 2 min )
    AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry. (arXiv:2301.07526v1 [cs.LG])
    In the insurance industry detecting fraudulent claims is a critical task with a significant financial impact. A common strategy to identify fraudulent claims is looking for inconsistencies in the supporting evidence. However, this is a laborious and cognitively heavy task for human experts as insurance claims typically come with a plethora of data from different modalities (e.g. images, text and metadata). To overcome this challenge, the research community has focused on multimodal machine learning frameworks that can efficiently reason through multiple data sources. Despite recent advances in multimodal learning, these frameworks still suffer from (i) challenges of joint-training caused by the different characteristics of different modalities and (ii) overfitting tendencies due to high model complexity. In this work, we address these challenges by introducing a multimodal reasoning framework, AutoFraudNet (Automobile Insurance Fraud Detection Network), for detecting fraudulent auto-insurance claims. AutoFraudNet utilizes a cascaded slow fusion framework and state-of-the-art fusion block, BLOCK Tucker, to alleviate the challenges of joint-training. Furthermore, it incorporates a light-weight architectural design along with additional losses to prevent overfitting. Through extensive experiments conducted on a real-world dataset, we demonstrate: (i) the merits of multimodal approaches, when compared to unimodal and bimodal methods, and (ii) the effectiveness of AutoFraudNet in fusing various modalities to boost performance (over 3\% in PR AUC).  ( 2 min )
    Training Semantic Segmentation on Heterogeneous Datasets. (arXiv:2301.07634v1 [cs.CV])
    We explore semantic segmentation beyond the conventional, single-dataset homogeneous training and bring forward the problem of Heterogeneous Training of Semantic Segmentation (HTSS). HTSS involves simultaneous training on multiple heterogeneous datasets, i.e. datasets with conflicting label spaces and different (weak) annotation types from the perspective of semantic segmentation. The HTSS formulation exposes deep networks to a larger and previously unexplored aggregation of information that can potentially enhance semantic segmentation in three directions: i) performance: increased segmentation metrics on seen datasets, ii) generalization: improved segmentation metrics on unseen datasets, and iii) knowledgeability: increased number of recognizable semantic concepts. To research these benefits of HTSS, we propose a unified framework, that incorporates heterogeneous datasets in a single-network training pipeline following the established FCN standard. Our framework first curates heterogeneous datasets to bring them into a common format and then trains a single-backbone FCN on all of them simultaneously. To achieve this, it transforms weak annotations, which are incompatible with semantic segmentation, to per-pixel labels, and hierarchizes their label spaces into a universal taxonomy. The trained HTSS models demonstrate performance and generalization gains over a wide range of datasets and extend the inference label space entailing hundreds of semantic classes.  ( 2 min )
    Curvilinear object segmentation in medical images based on ODoS filter and deep learning network. (arXiv:2301.07475v1 [eess.IV])
    Automatic segmentation of curvilinear objects in medical images plays an important role in the diagnosis and evaluation of human diseases, yet it is a challenging uncertainty for the complex segmentation task due to different issues like various image appearance, low contrast between curvilinear objects and their surrounding backgrounds, thin and uneven curvilinear structures, and improper background illumination. To overcome these challenges, we present a unique curvilinear structure segmentation framework based on oriented derivative of stick (ODoS) filter and deep learning network for curvilinear object segmentation in medical images. Currently, a large number of deep learning models emphasis on developing deep architectures and ignore capturing the structural features of curvature objects, which may lead to unsatisfactory results. In consequence, a new approach that incorporates the ODoS filter as part of a deep learning network is presented to improve the spatial attention of curvilinear objects. In which, the original image is considered as principal part to describe various image appearance and complex background illumination, the multi-step strategy is used to enhance contrast between curvilinear objects and their surrounding backgrounds, and the vector field is applied to discriminate thin and uneven curvilinear structures. Subsequently, a deep learning framework is employed to extract varvious structural features for curvilinear object segmentation in medical images. The performance of the computational model was validated in experiments with publicly available DRIVE, STARE and CHASEDB1 datasets. Experimental results indicate that the presented model has yielded surprising results compared with some state-of-the-art methods.  ( 2 min )
    Model-free machine learning of conservation laws from data. (arXiv:2301.07503v1 [cs.LG])
    We present a machine learning based method for learning first integrals of systems of ordinary differential equations from given trajectory data. The method is model-free in that it does not require explicit knowledge of the underlying system of differential equations that generated the trajectories. As a by-product, once the first integrals have been learned, also the system of differential equations will be known. We illustrate our method by considering several classical problems from the mathematical sciences.  ( 2 min )
    Machine learning techniques for the Schizophrenia diagnosis: A comprehensive review and future research directions. (arXiv:2301.07496v1 [cs.LG])
    Schizophrenia (SCZ) is a brain disorder where different people experience different symptoms, such as hallucination, delusion, flat-talk, disorganized thinking, etc. In the long term, this can cause severe effects and diminish life expectancy by more than ten years. Therefore, early and accurate diagnosis of SCZ is prevalent, and modalities like structural magnetic resonance imaging (sMRI), functional MRI (fMRI), diffusion tensor imaging (DTI), and electroencephalogram (EEG) assist in witnessing the brain abnormalities of the patients. Moreover, for accurate diagnosis of SCZ, researchers have used machine learning (ML) algorithms for the past decade to distinguish the brain patterns of healthy and SCZ brains using MRI and fMRI images. This paper seeks to acquaint SCZ researchers with ML and to discuss its recent applications to the field of SCZ study. This paper comprehensively reviews state-of-the-art techniques such as ML classifiers, artificial neural network (ANN), deep learning (DL) models, methodological fundamentals, and applications with previous studies. The motivation of this paper is to benefit from finding the research gaps that may lead to the development of a new model for accurate SCZ diagnosis. The paper concludes with the research finding, followed by the future scope that directly contributes to new research directions.  ( 2 min )
    Local Learning with Neuron Groups. (arXiv:2301.07635v1 [cs.LG])
    Traditional deep network training methods optimize a monolithic objective function jointly for all the components. This can lead to various inefficiencies in terms of potential parallelization. Local learning is an approach to model-parallelism that removes the standard end-to-end learning setup and utilizes local objective functions to permit parallel learning amongst model components in a deep network. Recent works have demonstrated that variants of local learning can lead to efficient training of modern deep networks. However, in terms of how much computation can be distributed, these approaches are typically limited by the number of layers in a network. In this work we propose to study how local learning can be applied at the level of splitting layers or modules into sub-components, adding a notion of width-wise modularity to the existing depth-wise modularity associated with local learning. We investigate local-learning penalties that permit such models to be trained efficiently. Our experiments on the CIFAR-10, CIFAR-100, and Imagenet32 datasets demonstrate that introducing width-level modularity can lead to computational advantages over existing methods based on local learning and opens new opportunities for improved model-parallel distributed training. Code is available at: https://github.com/adeetyapatel12/GN-DGL.  ( 2 min )
    Human-Timescale Adaptation in an Open-Ended Task Space. (arXiv:2301.07608v1 [cs.LG])
    Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.  ( 2 min )
    Synthcity: facilitating innovative use cases of synthetic data in different data modalities. (arXiv:2301.07573v1 [cs.LG])
    Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation across diverse tabular data modalities, including static data, regular and irregular time series, data with censoring, multi-source data, composite data, and more. Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data. It also offers the community a playground for rapid experimentation and prototyping, a one-stop-shop for SOTA benchmarks, and an opportunity for extending research impact. The library can be accessed on GitHub (https://github.com/vanderschaarlab/synthcity) and pip (https://pypi.org/project/synthcity/). We warmly invite the community to join the development effort by providing feedback, reporting bugs, and contributing code.  ( 2 min )
    Reslicing Ultrasound Images for Data Augmentation and Vessel Reconstruction. (arXiv:2301.07286v1 [eess.IV])
    Robot-guided catheter insertion has the potential to deliver urgent medical care in situations where medical personnel are unavailable. However, this technique requires accurate and reliable segmentation of anatomical landmarks in the body. For the ultrasound imaging modality, obtaining large amounts of training data for a segmentation model is time-consuming and expensive. This paper introduces RESUS (RESlicing of UltraSound Images), a weak supervision data augmentation technique for ultrasound images based on slicing reconstructed 3D volumes from tracked 2D images. This technique allows us to generate views which cannot be easily obtained in vivo due to physical constraints of ultrasound imaging, and use these augmented ultrasound images to train a semantic segmentation model. We demonstrate that RESUS achieves statistically significant improvement over training with non-augmented images and highlight qualitative improvements through vessel reconstruction.  ( 2 min )
    Learning a Formality-Aware Japanese Sentence Representation. (arXiv:2301.07209v1 [cs.CL])
    While the way intermediate representations are generated in encoder-decoder sequence-to-sequence models typically allow them to preserve the semantics of the input sentence, input features such as formality might be left out. On the other hand, downstream tasks such as translation would benefit from working with a sentence representation that preserves formality in addition to semantics, so as to generate sentences with the appropriate level of social formality -- the difference between speaking to a friend versus speaking with a supervisor. We propose a sequence-to-sequence method for learning a formality-aware representation for Japanese sentences, where sentence generation is conditioned on both the original representation of the input sentence, and a side constraint which guides the sentence representation towards preserving formality information. Additionally, we propose augmenting the sentence representation with a learned representation of formality which facilitates the extraction of formality in downstream tasks. We address the lack of formality-annotated parallel data by adapting previous works on procedural formality classification of Japanese sentences. Experimental results suggest that our techniques not only helps the decoder recover the formality of the input sentence, but also slightly improves the preservation of input sentence semantics.  ( 2 min )
    Efficient correlation-based discretization of continuous variables for annealing machines. (arXiv:2301.07244v1 [quant-ph])
    Annealing machines specialized for combinatorial optimization problems have been developed, and some companies offer services to use those machines. Such specialized machines can only handle binary variables, and their input format is the quadratic unconstrained binary optimization (QUBO) formulation. Therefore, discretization is necessary to solve problems with continuous variables. However, there is a severe constraint on the number of binary variables with such machines. Although the simple binary expansion in the previous research requires many binary variables, we need to reduce the number of such variables in the QUBO formulation due to the constraint. We propose a discretization method that involves using correlations of continuous variables. We numerically show that the proposed method reduces the number of necessary binary variables in the QUBO formulation without a significant loss in prediction accuracy.  ( 2 min )
    Discrete Latent Structure in Neural Networks. (arXiv:2301.07473v1 [cs.LG])
    Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties.  ( 2 min )
    DDPEN: Trajectory Optimisation With Sub Goal Generation Model. (arXiv:2301.07433v1 [cs.RO])
    Differential dynamic programming (DDP) is a widely used and powerful trajectory optimization technique, however, due to its internal structure, it is not exempt from local minima. In this paper, we present Differential Dynamic Programming with Escape Network (DDPEN) - a novel approach to avoid DDP local minima by utilising an additional term used in the optimization criteria pointing towards the direction where robot should move in order to escape local minima. In order to produce the aforementioned directions, we propose to utilize a deep model that takes as an input the map of the environment in the form of a costmap together with the desired goal position. The Model produces possible future directions that will lead to the goal, avoiding local minima which is possible to run in real time conditions. The model is trained on a synthetic dataset and overall the system is evaluated at the Gazebo simulator. In this work we show that our proposed method allows avoiding local minima of trajectory optimization algorithm and successfully execute a trajectory 278 m long with various convex and nonconvex obstacles.  ( 2 min )
    DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training. (arXiv:2301.07421v1 [cs.LG])
    We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator's verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.  ( 2 min )
    Image Embedding for Denoising Generative Models. (arXiv:2301.07485v1 [cs.CV])
    Denoising Diffusion models are gaining increasing popularity in the field of generative modeling for several reasons, including the simple and stable training, the excellent generative quality, and the solid probabilistic foundation. In this article, we address the problem of {\em embedding} an image into the latent space of Denoising Diffusion Models, that is finding a suitable ``noisy'' image whose denoising results in the original image. We particularly focus on Denoising Diffusion Implicit Models due to the deterministic nature of their reverse diffusion process. As a side result of our investigation, we gain a deeper insight into the structure of the latent space of diffusion models, opening interesting perspectives on its exploration, the definition of semantic trajectories, and the manipulation/conditioning of encodings for editing purposes. A particularly interesting property highlighted by our research, which is also characteristic of this class of generative models, is the independence of the latent representation from the networks implementing the reverse diffusion process. In other words, a common seed passed to different networks (each trained on the same dataset), eventually results in identical images.  ( 2 min )
    Optimistic Dynamic Regret Bounds. (arXiv:2301.07530v1 [cs.LG])
    Online Learning (OL) algorithms have originally been developed to guarantee good performances when comparing their output to the best fixed strategy. The question of performance with respect to dynamic strategies remains an active research topic. We develop in this work dynamic adaptations of classical OL algorithms based on the use of experts' advice and the notion of optimism. We also propose a constructivist method to generate those advices and eventually provide both theoretical and experimental guarantees for our procedures.  ( 2 min )
    PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav. (arXiv:2301.07302v1 [cs.LG])
    We study ObjectGoal Navigation - where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) on a dataset of human demonstrations achieves promising results. However, this has limitations $-$ 1) IL policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present a two-stage learning scheme for IL pretraining on human demonstrations followed by RL-finetuning. This leads to a PIRLNav policy that advances the state-of-the-art on ObjectNav from $60.0\%$ success rate to $65.0\%$ ($+5.0\%$ absolute). Using this IL$\rightarrow$RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with `free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that IL$\rightarrow$RL on human demonstrations outperforms IL$\rightarrow$RL on SP and FE trajectories, even when controlled for the same IL-pretraining success on TRAIN, and even on a subset of VAL episodes where IL-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the IL pretraining dataset. We find that as we increase the size of the IL-pretraining dataset and get to high IL accuracies, the improvements from RL-finetuning are smaller, and that $90\%$ of the performance of our best IL$\rightarrow$RL policy can be achieved with less than half the number of IL demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.  ( 2 min )
    Beating the Best: Improving on AlphaFold2 at Protein Structure Prediction. (arXiv:2301.07568v1 [q-bio.BM])
    The goal of Protein Structure Prediction (PSP) problem is to predict a protein's 3D structure (confirmation) from its amino acid sequence. The problem has been a 'holy grail' of science since the Noble prize-winning work of Anfinsen demonstrated that protein conformation was determined by sequence. A recent and important step towards this goal was the development of AlphaFold2, currently the best PSP method. AlphaFold2 is probably the highest profile application of AI to science. Both AlphaFold2 and RoseTTAFold (another impressive PSP method) have been published and placed in the public domain (code & models). Stacking is a form of ensemble machine learning ML in which multiple baseline models are first learnt, then a meta-model is learnt using the outputs of the baseline level model to form a model that outperforms the base models. Stacking has been successful in many applications. We developed the ARStack PSP method by stacking AlphaFold2 and RoseTTAFold. ARStack significantly outperforms AlphaFold2. We rigorously demonstrate this using two sets of non-homologous proteins, and a test set of protein structures published after that of AlphaFold2 and RoseTTAFold. As more high quality prediction methods are published it is likely that ensemble methods will increasingly outperform any single method.  ( 2 min )
    Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. (arXiv:2301.07487v1 [cs.LG])
    Learning from raw high dimensional data via interaction with a given environment has been effectively achieved through the utilization of deep neural networks. Yet the observed degradation in policy performance caused by imperceptible worst-case policy dependent translations along high sensitivity directions (i.e. adversarial perturbations) raises concerns on the robustness of deep reinforcement learning policies. In our paper, we show that these high sensitivity directions do not lie only along particular worst-case directions, but rather are more abundant in the deep neural policy landscape and can be found via more natural means in a black-box setting. Furthermore, we show that vanilla training techniques intriguingly result in learning more robust policies compared to the policies learnt via the state-of-the-art adversarial training techniques. We believe our work lays out intriguing properties of the deep reinforcement learning policy manifold and our results can help to build robust and generalizable deep reinforcement learning policies.  ( 2 min )
    Learning Deformation Trajectories of Boltzmann Densities. (arXiv:2301.07388v1 [stat.ML])
    We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = (|x|/\sigma)^p$. This then induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along this family of densities. Concretely, this condition can be translated to a PDE between $V_t$ and $f_t$ and we minimize the amount by which this PDE fails to hold. We compare this objective to the reverse KL-divergence on Gaussian mixtures and on the $\phi^4$ lattice field theory on a circle.  ( 2 min )
    Threats, Vulnerabilities, and Controls of Machine Learning Based Systems: A Survey and Taxonomy. (arXiv:2301.07474v1 [cs.CR])
    In this article, we propose the Artificial Intelligence Security Taxonomy to systematize the knowledge of threats, vulnerabilities, and security controls of ML-based systems. We first classify the damage caused by attacks against ML-based systems, define ML-specific security, and discuss its characteristics. Next, we enumerate all relevant assets and stakeholders and provide a general taxonomy for ML-specific threats. Then, we collect a wide range of security controls against ML-specific threats through an extensive review of recent literature. Finally, we classify the vulnerabilities and controls of an ML-based system in terms of each vulnerable asset in the system's entire lifecycle.  ( 2 min )
    PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection. (arXiv:2301.07301v1 [cs.CV])
    In autonomous driving, 3D object detection based on multi-modal data has become an indispensable approach when facing complex environments around the vehicle. During multi-modal detection, LiDAR and camera are simultaneously applied for capturing and modeling. However, due to the intrinsic discrepancies between the LiDAR point and camera image, the fusion of the data for object detection encounters a series of problems. Most multi-modal detection methods perform even worse than LiDAR-only methods. In this investigation, we propose a method named PTA-Det to improve the performance of multi-modal detection. Accompanied by PTA-Det, a Pseudo Point Cloud Generation Network is proposed, which can convert image information including texture and semantic features by pseudo points. Thereafter, through a transformer-based Point Fusion Transition (PFT) module, the features of LiDAR points and pseudo points from image can be deeply fused under a unified point-based representation. The combination of these modules can conquer the major obstacle in feature fusion across modalities and realizes a complementary and discriminative representation for proposal generation. Extensive experiments on the KITTI dataset show the PTA-Det achieves a competitive result and support its effectiveness.  ( 2 min )
    Causal Falsification of Digital Twins. (arXiv:2301.07210v1 [stat.ME])
    Digital twins hold substantial promise in many applications, but rigorous procedures for assessing their accuracy are essential for their widespread deployment in safety-critical settings. By formulating this task within the framework of causal inference, we show it is not possible to certify that a twin is "correct" using real-world observational data unless potentially tenuous assumptions are made about the data-generating process. To avoid these assumptions, we propose an assessment strategy that instead aims to find cases where the twin is not correct, and present a general-purpose statistical procedure for doing so that may be used across a wide variety of applications and twin models. Our approach yields reliable and actionable information about the twin under only the assumption of an i.i.d. dataset of real-world observations, and in particular remains sound even in the presence of arbitrary unmeasured confounding. We demonstrate the effectiveness of our methodology via a large-scale case study involving sepsis modelling within the Pulse Physiology Engine, which we assess using the MIMIC-III dataset of ICU patients.  ( 2 min )
    Improve Noise Tolerance of Robust Loss via Noise-Awareness. (arXiv:2301.07306v1 [cs.LG])
    Robust loss minimization is an important strategy for handling robust learning issue on noisy labels. Current robust losses, however, inevitably involve hyperparameters to be tuned for different datasets with noisy labels, manually or heuristically through cross validation, which makes them fairly hard to be generally applied in practice. Existing robust loss methods usually assume that all training samples share common hyperparameters, which are independent of instances. This limits the ability of these methods on distinguishing individual noise properties of different samples, making them hardly adapt to different noise structures. To address above issues, we propose to assemble robust loss with instance-dependent hyperparameters to improve their noise-tolerance with theoretical guarantee. To achieve setting such instance-dependent hyperparameters for robust loss, we propose a meta-learning method capable of adaptively learning a hyperparameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster). Specifically, through mutual amelioration between hyperparameter prediction function and classifier parameters in our method, both of them can be simultaneously finely ameliorated and coordinated to attain solutions with good generalization capability. Four kinds of SOTA robust losses are attempted to be integrated with our algorithm, and experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and generalization performance. Meanwhile, the explicit parameterized structure makes the meta-learned prediction function capable of being readily transferrable and plug-and-play to unseen datasets with noisy labels. Specifically, we transfer our meta-learned NARL-Adjuster to unseen tasks, including several real noisy datasets, and achieve better performance compared with conventional hyperparameter tuning strategy.  ( 2 min )
    Tailor: Altering Skip Connections for Resource-Efficient Inference. (arXiv:2301.07247v1 [cs.CV])
    Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network's skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor, a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network's skip connections to lower their hardware cost. The optimized hardware designs improve resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs.  ( 2 min )
    Detecting and Ranking Causal Anomalies in End-to-End Complex System. (arXiv:2301.07281v1 [cs.LG])
    With the rapid development of technology, the automated monitoring systems of large-scale factories are becoming more and more important. By collecting a large amount of machine sensor data, we can have many ways to find anomalies. We believe that the real core value of an automated monitoring system is to identify and track the cause of the problem. The most famous method for finding causal anomalies is RCA, but there are many problems that cannot be ignored. They used the AutoRegressive eXogenous (ARX) model to create a time-invariant correlation network as a machine profile, and then use this profile to track the causal anomalies by means of a method called fault propagation. There are two major problems in describing the behavior of a machine by using the correlation network established by ARX: (1) It does not take into account the diversity of states (2) It does not separately consider the correlations with different time-lag. Based on these problems, we propose a framework called Ranking Causal Anomalies in End-to-End System (RCAE2E), which completely solves the problems mentioned above. In the experimental part, we use synthetic data and real-world large-scale photoelectric factory data to verify the correctness and existence of our method hypothesis.  ( 2 min )
    Towards Models that Can See and Read. (arXiv:2301.07389v1 [cs.CV])
    Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite the obvious resemblance between them, the two are treated independently, yielding task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on VQA and CAP by up to 3.49% and 0.7 CIDEr, respectively.  ( 2 min )
    Relativistic Digital Twin: Bringing the IoT to the Future. (arXiv:2301.07390v1 [cs.NI])
    Complex IoT ecosystems often require the usage of Digital Twins (DTs) of their physical assets in order to perform predictive analytics and simulate what-if scenarios. DTs are able to replicate IoT devices and adapt over time to their behavioral changes. However, DTs in IoT are typically tailored to a specific use case, without the possibility to seamlessly adapt to different scenarios. Further, the fragmentation of IoT poses additional challenges on how to deploy DTs in heterogeneous scenarios characterized by the usage of multiple data formats and IoT network protocols. In this paper, we propose the Relativistic Digital Twin (RDT) framework, through which we automatically generate general purpose DTs of IoT entities and tune their behavioral models over time by constantly observing their real counterparts. The framework relies on the object representation via the Web of Things (WoT), to offer a standardized interface to each of the IoT devices as well as to their DTs. To this purpose, we extended the W3C WoT standard in order to encompass the concept of behavioral model and define it in the Thing Description (TD) through a new vocabulary. Finally, we evaluated the RDT framework over two disjoint use cases to assess its correctness and learning performance, i.e. the DT of a simulated smart home scenario with the capability of forecasting the indoor temperature, and the DT of a real-world drone with the capability of forecasting its trajectory in an outdoor scenario.  ( 2 min )
    Complexity Analysis of a Countable-armed Bandit Problem. (arXiv:2301.07243v1 [cs.LG])
    We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.  ( 2 min )
    Label Inference Attack against Split Learning under Regression Setting. (arXiv:2301.07284v1 [cs.CR])
    As a crucial building block in vertical Federated Learning (vFL), Split Learning (SL) has demonstrated its practice in the two-party model training collaboration, where one party holds the features of data samples and another party holds the corresponding labels. Such method is claimed to be private considering the shared information is only the embedding vectors and gradients instead of private raw data and labels. However, some recent works have shown that the private labels could be leaked by the gradients. These existing attack only works under the classification setting where the private labels are discrete. In this work, we step further to study the leakage in the scenario of the regression model, where the private labels are continuous numbers (instead of discrete labels in classification). This makes previous attacks harder to infer the continuous labels due to the unbounded output range. To address the limitation, we propose a novel learning-based attack that integrates gradient information and extra learning regularization objectives in aspects of model training properties, which can infer the labels under regression settings effectively. The comprehensive experiments on various datasets and models have demonstrated the effectiveness of our proposed attack. We hope our work can pave the way for future analyses that make the vFL framework more secure.  ( 2 min )
    Adapting Multilingual Speech Representation Model for a New, Underresourced Language through Multilingual Fine-tuning and Continued Pretraining. (arXiv:2301.07295v1 [cs.CL])
    In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech transcription, pretraining a multilingual model from scratch for every new language would be highly impractical. We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language, focusing on actual fieldwork data from a critically endangered tongue: Ainu. Specifically, we (i) examine the feasibility of leveraging data from similar languages also in fine-tuning; (ii) verify whether the model's performance can be improved by further pretraining on target language data. Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language and leads to considerable reduction in error rates. Furthermore, we find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance when there is very little labeled data in the target language.  ( 2 min )
    A variational autoencoder-based nonnegative matrix factorisation model for deep dictionary learning. (arXiv:2301.07272v1 [cs.LG])
    Construction of dictionaries using nonnegative matrix factorisation (NMF) has extensive applications in signal processing and machine learning. With the advances in deep learning, training compact and robust dictionaries using deep neural networks, i.e., dictionaries of deep features, has been proposed. In this study, we propose a probabilistic generative model which employs a variational autoencoder (VAE) to perform nonnegative dictionary learning. In contrast to the existing VAE models, we cast the model under a statistical framework with latent variables obeying a Gamma distribution and design a new loss function to guarantee the nonnegative dictionaries. We adopt an acceptance-rejection sampling reparameterization trick to update the latent variables iteratively. We apply the dictionaries learned from VAE-NMF to two signal processing tasks, i.e., enhancement of speech and extraction of muscle synergies. Experimental results demonstrate that VAE-NMF performs better in learning the latent nonnegative dictionaries in comparison with state-of-the-art methods.  ( 2 min )
    Tracking Brand-Associated Polarity-Bearing Topics in User Reviews. (arXiv:2301.07183v1 [cs.IR])
    Monitoring online customer reviews is important for business organisations to measure customer satisfaction and better manage their reputations. In this paper, we propose a novel dynamic Brand-Topic Model (dBTM) which is able to automatically detect and track brand-associated sentiment scores and polarity-bearing topics from product reviews organised in temporally-ordered time intervals. dBTM models the evolution of the latent brand polarity scores and the topic-word distributions over time by Gaussian state space models. It also incorporates a meta learning strategy to control the update of the topic-word distribution in each time interval in order to ensure smooth topic transitions and better brand score predictions. It has been evaluated on a dataset constructed from MakeupAlley reviews and a hotel review dataset. Experimental results show that dBTM outperforms a number of competitive baselines in brand ranking, achieving a good balance of topic coherence and uniqueness, and extracting well-separated polarity-bearing topics across time intervals.  ( 2 min )
    Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data. (arXiv:2301.07206v1 [stat.ML])
    Relating a set of variables X to a response y is crucial in chemometrics. A quantitative prediction objective can be enriched by qualitative data interpretation, for instance by locating the most influential features. When high-dimensional problems arise, dimension reduction techniques can be used. Most notable are projections (e.g. Partial Least Squares or PLS ) or variable selections (e.g. lasso). Sparse partial least squares combine both strategies, by blending variable selection into PLS. The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm. It provides balance between accurate prediction and efficient interpretation. It is based on penalizations inspired by classical regression methods (lasso, group lasso, least squares, ridge) and uses the dual norm notion. The resulting sparsity is enforced by an intuitive shrinking ratio parameter. Dual-sPLS favorably compares to similar regression methods, on simulated and real chemical data. Code is provided as an open-source package in R: \url{https://CRAN.R-project.org/package=dual.spls}.  ( 2 min )
    Scaffold-Based Multi-Objective Drug Candidate Optimization. (arXiv:2301.07175v1 [q-bio.BM])
    Multiparameter optimization (MPO) provides a means to assess and balance several variables based on their importance to the overall objective. However, using MPO methods in therapeutic discovery is challenging due to the number of cheminformatics properties required to find an optimal solution. High throughput virtual screening to identify hit candidates produces a large amount of data with conflicting properties. For instance, toxicity and binding affinity can contradict each other and cause improbable levels of toxicity that can lead to adverse effects. Instead of using the exhaustive method of treating each property, multiple properties can be combined into a single MPO score, with weights assigned for each property. This desirability score also lends itself well to ML applications that can use the score in the loss function. In this work, we will discuss scaffold focused graph-based Markov chain monte carlo framework built to generate molecules with optimal properties. This framework trains itself on-the-fly with the MPO score of each iteration of molecules, and is able to work on a greater number of properties and sample the chemical space around a starting scaffold. Results are compared to the chemical Transformer model molGCT to judge performance between graph and natural language processing approaches.  ( 2 min )
    Artificial Neuronal Ensembles with Learned Context Dependent Gating. (arXiv:2301.07187v1 [cs.LG])
    Biological neural networks are capable of recruiting different sets of neurons to encode different memories. However, when training artificial neural networks on a set of tasks, typically, no mechanism is employed for selectively producing anything analogous to these neuronal ensembles. Further, artificial neural networks suffer from catastrophic forgetting, where the network's performance rapidly deteriorates as tasks are learned sequentially. By contrast, sequential learning is possible for a range of biological organisms. We introduce Learned Context Dependent Gating (LXDG), a method to flexibly allocate and recall `artificial neuronal ensembles', using a particular network structure and a new set of regularization terms. Activities in the hidden layers of the network are modulated by gates, which are dynamically produced during training. The gates are outputs of networks themselves, trained with a sigmoid output activation. The regularization terms we have introduced correspond to properties exhibited by biological neuronal ensembles. The first term penalizes low gate sparsity, ensuring that only a specified fraction of the network is used. The second term ensures that previously learned gates are recalled when the network is presented with input from previously learned tasks. Finally, there is a regularization term responsible for ensuring that new tasks are encoded in gates that are as orthogonal as possible from previously used ones. We demonstrate the ability of this method to alleviate catastrophic forgetting on continual learning benchmarks. When the new regularization terms are included in the model along with Elastic Weight Consolidation (EWC) it achieves better performance on the benchmark `permuted MNIST' than with EWC alone. The benchmark `rotated MNIST' demonstrates how similar tasks recruit similar neurons to the artificial neuronal ensemble.  ( 2 min )
    Revisiting mass-radius relationships for exoplanet populations: a machine learning insight. (arXiv:2301.07143v1 [astro-ph.EP])
    The growing number of exoplanet discoveries and advances in machine learning techniques allow us to find, explore, and understand characteristics of these new worlds beyond our Solar System. We analyze the dataset of 762 confirmed exoplanets and eight Solar System planets using efficient machine-learning approaches to characterize their fundamental quantities. By adopting different unsupervised clustering algorithms, the data are divided into two main classes: planets with $\log R_{p}\leq0.91R_{\oplus}$ and $\log M_{p}\leq1.72M_{\oplus}$ as class 1 and those with $\log R_{p}>0.91R_{\oplus}$ and $\log M_{p}>1.72M_{\oplus}$ as class 2. Various regression models are used to reveal correlations between physical parameters and evaluate their performance. We find that planetary mass, orbital period, and stellar mass play preponderant roles in predicting exoplanet radius. The validation metrics (RMSE, MAE, and $R^{2}$) suggest that the Support Vector Regression has, by and large, better performance than other models and is a promising model for obtaining planetary radius. Not only do we improve the prediction accuracy in logarithmic space, but also we derive parametric equations using the M5P and Markov Chain Monte Carlo methods. Planets of class 1 are shown to be consistent with a positive linear mass-radius relation, while for planets of class 2, the planetary radius represents a strong correlation with their host stars' masses.  ( 2 min )
    Heterogeneous Multi-Robot Reinforcement Learning. (arXiv:2301.07137v1 [cs.RO])
    Cooperative multi-robot tasks can benefit from heterogeneity in the robots' physical and behavioral traits. In spite of this, traditional Multi-Agent Reinforcement Learning (MARL) frameworks lack the ability to explicitly accommodate policy heterogeneity, and typically constrain agents to share neural network parameters. This enforced homogeneity limits application in cases where the tasks benefit from heterogeneous behaviors. In this paper, we crystallize the role of heterogeneity in MARL policies. Towards this end, we introduce Heterogeneous Graph Neural Network Proximal Policy Optimization (HetGPPO), a paradigm for training heterogeneous MARL policies that leverages a Graph Neural Network for differentiable inter-agent communication. HetGPPO allows communicating agents to learn heterogeneous behaviors while enabling fully decentralized training in partially observable environments. We complement this with a taxonomical overview that exposes more heterogeneity classes than previously identified. To motivate the need for our model, we present a characterization of techniques that homogeneous models can leverage to emulate heterogeneous behavior, and show how this "apparent heterogeneity" is brittle in real-world conditions. Through simulations and real-world experiments, we show that: (i) when homogeneous methods fail due to strong heterogeneous requirements, HetGPPO succeeds, and, (ii) when homogeneous methods are able to learn apparently heterogeneous behaviors, HetGPPO achieves higher resilience to both training and deployment noise.  ( 2 min )
    Mortality Prediction with Adaptive Feature Importance Recalibration for Peritoneal Dialysis Patients: a deep-learning-based study on a real-world longitudinal follow-up dataset. (arXiv:2301.07107v1 [cs.LG])
    Objective: Peritoneal Dialysis (PD) is one of the most widely used life-supporting therapies for patients with End-Stage Renal Disease (ESRD). Predicting mortality risk and identifying modifiable risk factors based on the Electronic Medical Records (EMR) collected along with the follow-up visits are of great importance for personalized medicine and early intervention. Here, our objective is to develop a deep learning model for a real-time, individualized, and interpretable mortality prediction model - AICare. Method and Materials: Our proposed model consists of a multi-channel feature extraction module and an adaptive feature importance recalibration module. AICare explicitly identifies the key features that strongly indicate the outcome prediction for each patient to build the health status embedding individually. This study has collected 13,091 clinical follow-up visits and demographic data of 656 PD patients. To verify the application universality, this study has also collected 4,789 visits of 1,363 hemodialysis dialysis (HD) as an additional experiment dataset to test the prediction performance, which will be discussed in the Appendix. Results: 1) Experiment results show that AICare achieves 81.6%/74.3% AUROC and 47.2%/32.5% AUPRC for the 1-year mortality prediction task on PD/HD dataset respectively, which outperforms the state-of-the-art comparative deep learning models. 2) This study first provides a comprehensive elucidation of the relationship between the causes of mortality in patients with PD and clinical features based on an end-to-end deep learning model. 3) This study first reveals the pattern of variation in the importance of each feature in the mortality prediction based on built-in interpretability. 4) We develop a practical AI-Doctor interaction system to visualize the trajectory of patients' health status and risk indicators.  ( 3 min )
    Genetic Imitation Learning by Reward Extrapolation. (arXiv:2301.07182v1 [cs.NE])
    Imitation learning demonstrates remarkable performance in various domains. However, imitation learning is also constrained by many prerequisites. The research community has done intensive research to alleviate these constraints, such as adding the stochastic policy to avoid unseen states, eliminating the need for action labels, and learning from the suboptimal demonstrations. Inspired by the natural reproduction process, we proposed a method called GenIL that integrates the Genetic Algorithm with imitation learning. The involvement of the Genetic Algorithm improves the data efficiency by reproducing trajectories with various returns and assists the model in estimating more accurate and compact reward function parameters. We tested GenIL in both Atari and Mujoco domains, and the result shows that it successfully outperforms the previous extrapolation methods over extrapolation accuracy, robustness, and overall policy performance when input data is limited.  ( 2 min )
    A Combinatorial Semi-Bandit Approach to Charging Station Selection for Electric Vehicles. (arXiv:2301.07156v1 [cs.LG])
    In this work, we address the problem of long-distance navigation for battery electric vehicles (BEVs), where one or more charging sessions are required to reach the intended destination. We consider the availability and performance of the charging stations to be unknown and stochastic, and develop a combinatorial semi-bandit framework for exploring the road network to learn the parameters of the queue time and charging power distributions. Within this framework, we first outline a pre-processing for the road network graph to handle the constrained combinatorial optimization problem in an efficient way. Then, for the pre-processed graph, we use a Bayesian approach to model the stochastic edge weights, utilizing conjugate priors for the one-parameter exponential and two-parameter gamma distributions, the latter of which is novel to multi-armed bandit literature. Finally, we apply combinatorial versions of Thompson Sampling, BayesUCB and Epsilon-greedy to the problem. We demonstrate the performance of our framework on long-distance navigation problem instances in country-sized road networks, with simulation experiments in Norway, Sweden and Finland.  ( 2 min )
    Large Deviations for Classification Performance Analysis of Machine Learning Systems. (arXiv:2301.07104v1 [cs.LG])
    We study the performance of machine learning binary classification techniques in terms of error probabilities. The statistical test is based on the Data-Driven Decision Function (D3F), learned in the training phase, i.e., what is thresholded before the final binary decision is made. Based on large deviations theory, we show that under appropriate conditions the classification error probabilities vanish exponentially, as $\sim \exp\left(-n\,I + o(n) \right)$, where $I$ is the error rate and $n$ is the number of observations available for testing. We also propose two different approximations for the error probability curves, one based on a refined asymptotic formula (often referred to as exact asymptotics), and another one based on the central limit theorem. The theoretical findings are finally tested using the popular MNIST dataset.  ( 2 min )
    Continuous Trajectory Generation Based on Two-Stage GAN. (arXiv:2301.07103v1 [cs.LG])
    Simulating the human mobility and generating large-scale trajectories are of great use in many real-world applications, such as urban planning, epidemic spreading analysis, and geographic privacy protect. Although many previous works have studied the problem of trajectory generation, the continuity of the generated trajectories has been neglected, which makes these methods useless for practical urban simulation scenarios. To solve this problem, we propose a novel two-stage generative adversarial framework to generate the continuous trajectory on the road network, namely TS-TrajGen, which efficiently integrates prior domain knowledge of human mobility with model-free learning paradigm. Specifically, we build the generator under the human mobility hypothesis of the A* algorithm to learn the human mobility behavior. For the discriminator, we combine the sequential reward with the mobility yaw reward to enhance the effectiveness of the generator. Finally, we propose a novel two-stage generation process to overcome the weak point of the existing stochastic generation process. Extensive experiments on two real-world datasets and two case studies demonstrate that our framework yields significant improvements over the state-of-the-art methods.  ( 2 min )
    On Using Deep Learning Proxies as Forward Models in Deep Learning Problems. (arXiv:2301.07102v1 [cs.LG])
    Physics-based optimization problems are generally very time-consuming, especially due to the computational complexity associated with the forward model. Recent works have demonstrated that physics-modelling can be approximated with neural networks. However, there is always a certain degree of error associated with this learning, and we study this aspect in this paper. We demonstrate through experiments on popular mathematical benchmarks, that neural network approximations (NN-proxies) of such functions when plugged into the optimization framework, can lead to erroneous results. In particular, we study the behavior of particle swarm optimization and genetic algorithm methods and analyze their stability when coupled with NN-proxies. The correctness of the approximate model depends on the extent of sampling conducted in the parameter space, and through numerical experiments, we demonstrate that caution needs to be taken when constructing this landscape with neural networks. Further, the NN-proxies are hard to train for higher dimensional functions, and we present our insights for 4D and 10D problems. The error is higher for such cases, and we demonstrate that it is sensitive to the choice of the sampling scheme used to build the NN-proxy. The code is available at https://github.com/Fa-ti-ma/NN-proxy-in-optimization.  ( 2 min )
    The moral authority of ChatGPT. (arXiv:2301.07098v1 [cs.CY])
    ChatGPT is not only fun to chat with, but it also searches information, answers questions, and gives advice. With consistent moral advice, it might improve the moral judgment and decisions of users, who often hold contradictory moral beliefs. Unfortunately, ChatGPT turns out highly inconsistent as a moral advisor. Nonetheless, it influences users' moral judgment, we find in an experiment, even if they know they are advised by a chatting bot, and they underestimate how much they are influenced. Thus, ChatGPT threatens to corrupt rather than improves users' judgment. These findings raise the question of how to ensure the responsible use of ChatGPT and similar AI. Transparency is often touted but seems ineffective. We propose training to improve digital literacy.  ( 2 min )
    Distributed LSTM-Learning from Differentially Private Label Proportions. (arXiv:2301.07101v1 [cs.LG])
    Data privacy and decentralised data collection has become more and more popular in recent years. In order to solve issues with privacy, communication bandwidth and learning from spatio-temporal data, we will propose two efficient models which use Differential Privacy and decentralized LSTM-Learning: One, in which a Long Short Term Memory (LSTM) model is learned for extracting local temporal node constraints and feeding them into a Dense-Layer (LabelProportionToLocal). The other approach extends the first one by fetching histogram data from the neighbors and joining the information with the LSTM output (LabelProportionToDense). For evaluation two popular datasets are used: Pems-Bay and METR-LA. Additionally, we provide an own dataset, which is based on LuST. The evaluation will show the tradeoff between performance and data privacy.  ( 2 min )
    EENet: Learning to Early Exit for Adaptive Inference. (arXiv:2301.07099v1 [cs.LG])
    Budgeted adaptive inference with early exits is an emerging technique to improve the computational efficiency of deep neural networks (DNNs) for edge AI applications with limited resources at test time. This method leverages the fact that different test data samples may not require the same amount of computation for a correct prediction. By allowing early exiting from full layers of DNN inference for some test examples, we can reduce latency and improve throughput of edge inference while preserving performance. Although there have been numerous studies on designing specialized DNN architectures for training early-exit enabled DNN models, most of the existing work employ hand-tuned or manual rule-based early exit policies. In this study, we introduce a novel multi-exit DNN inference framework, coined as EENet, which leverages multi-objective learning to optimize the early exit policy for a trained multi-exit DNN under a given inference budget. This paper makes two novel contributions. First, we introduce the concept of early exit utility scores by combining diverse confidence measures with class-wise prediction scores to better estimate the correctness of test-time predictions at a given exit. Second, we train a lightweight, budget-driven, multi-objective neural network over validation predictions to learn the exit assignment scheduling for query examples at test time. The EENet early exit scheduler optimizes both the distribution of test samples to different exits and the selection of the exit utility thresholds such that the given inference budget is satisfied while the performance metric is maximized. Extensive experiments are conducted on five benchmarks, including three image datasets (CIFAR-10, CIFAR-100, ImageNet) and two NLP datasets (SST-2, AgNews). The results demonstrate the performance improvements of EENet compared to existing representative early exit techniques.  ( 2 min )
  • Open

    Data thinning for convolution-closed distributions. (arXiv:2301.07276v1 [stat.ME])
    We propose data thinning, a new approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to any observation drawn from a "convolution closed" distribution, a class that includes the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. It is similar in spirit to -- but distinct from, and more easily applicable than -- a recent proposal known as data fission. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the "usual" approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis.  ( 2 min )
    Optimal Sub-sampling to Boost Power of Kernel Sequential Change-point Detection. (arXiv:2210.15060v2 [stat.ME] UPDATED)
    We present a novel scheme to boost detection power for kernel maximum mean discrepancy based sequential change-point detection procedures. Our proposed scheme features an optimal sub-sampling of the history data before the detection procedure, in order to tackle the power loss incurred by the random sub-sample from the enormous history data. We apply our proposed scheme to both Scan $B$ and Kernel Cumulative Sum (CUSUM) procedures, and improved performance is observed from extensive numerical experiments.  ( 2 min )
    A Combinatorial Semi-Bandit Approach to Charging Station Selection for Electric Vehicles. (arXiv:2301.07156v1 [cs.LG])
    In this work, we address the problem of long-distance navigation for battery electric vehicles (BEVs), where one or more charging sessions are required to reach the intended destination. We consider the availability and performance of the charging stations to be unknown and stochastic, and develop a combinatorial semi-bandit framework for exploring the road network to learn the parameters of the queue time and charging power distributions. Within this framework, we first outline a pre-processing for the road network graph to handle the constrained combinatorial optimization problem in an efficient way. Then, for the pre-processed graph, we use a Bayesian approach to model the stochastic edge weights, utilizing conjugate priors for the one-parameter exponential and two-parameter gamma distributions, the latter of which is novel to multi-armed bandit literature. Finally, we apply combinatorial versions of Thompson Sampling, BayesUCB and Epsilon-greedy to the problem. We demonstrate the performance of our framework on long-distance navigation problem instances in country-sized road networks, with simulation experiments in Norway, Sweden and Finland.  ( 2 min )
    Learning Deformation Trajectories of Boltzmann Densities. (arXiv:2301.07388v1 [stat.ML])
    We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = (|x|/\sigma)^p$. This then induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along this family of densities. Concretely, this condition can be translated to a PDE between $V_t$ and $f_t$ and we minimize the amount by which this PDE fails to hold. We compare this objective to the reverse KL-divergence on Gaussian mixtures and on the $\phi^4$ lattice field theory on a circle.  ( 2 min )
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v3 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that approximates the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples at virtually no cost. However, the accuracy of amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increase the resilience of amortized variational inference in the presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and the (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in the number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty at approximately the same cost as five reverse-time migrations.  ( 2 min )
    A Nonsmooth Dynamical Systems Perspective on Accelerated Extensions of ADMM. (arXiv:1808.04048v7 [math.OC] UPDATED)
    Recently, there has been great interest in connections between continuous-time dynamical systems and optimization methods, notably in the context of accelerated methods for smooth and unconstrained problems. In this paper we extend this perspective to nonsmooth and constrained problems by obtaining differential inclusions associated to novel accelerated variants of the alternating direction method of multipliers (ADMM). Through a Lyapunov analysis, we derive rates of convergence for these dynamical systems in different settings that illustrate an interesting tradeoff between decaying versus constant damping strategies. We also obtain modified equations capturing fine-grained details of these methods, which have improved stability and preserve the leading order convergence rates. An extension to general nonlinear equality and inequality constraints in connection with singular perturbation theory is provided.  ( 2 min )
    Optimistic Dynamic Regret Bounds. (arXiv:2301.07530v1 [cs.LG])
    Online Learning (OL) algorithms have originally been developed to guarantee good performances when comparing their output to the best fixed strategy. The question of performance with respect to dynamic strategies remains an active research topic. We develop in this work dynamic adaptations of classical OL algorithms based on the use of experts' advice and the notion of optimism. We also propose a constructivist method to generate those advices and eventually provide both theoretical and experimental guarantees for our procedures.  ( 2 min )
    Functional Neural Networks: Shift invariant models for functional data with applications to EEG classification. (arXiv:2301.05869v1 [cs.LG] CROSS LISTED)
    It is desirable for statistical models to detect signals of interest independently of their position. If the data is generated by some smooth process, this additional structure should be taken into account. We introduce a new class of neural networks that are shift invariant and preserve smoothness of the data: functional neural networks (FNNs). For this, we use methods from functional data analysis (FDA) to extend multi-layer perceptrons and convolutional neural networks to functional data. We propose different model architectures, show that the models outperform a benchmark model from FDA in terms of accuracy and successfully use FNNs to classify electroencephalography (EEG) data.  ( 2 min )
    Complexity Analysis of a Countable-armed Bandit Problem. (arXiv:2301.07243v1 [cs.LG])
    We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.  ( 2 min )
    Improving Federated Learning Personalization via Model Agnostic Meta Learning. (arXiv:1909.12488v2 [cs.LG] UPDATED)
    Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this work, we point out that the setting of Model Agnostic Meta Learning (MAML), where one optimizes for a fast, gradient-based, few-shot adaptation to a heterogeneous distribution of tasks, has a number of similarities with the objective of personalization for FL. We present FL as a natural source of practical applications for MAML algorithms, and make the following observations. 1) The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. 2) Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 3) A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. These results raise new questions for FL, MAML, and broader ML research.  ( 2 min )
    What relations are reliably embeddable in Euclidean space?. (arXiv:1903.05347v3 [cs.LG] UPDATED)
    We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed.  ( 2 min )
    Global Contrastive Batch Sampling via Optimization on Sample Permutations. (arXiv:2210.12874v3 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining, Global Contrastive Batch Sampling (GCBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$, in contrastive learning settings. Through experimentation we find GCBS improves state-of-the-art performance in sentence embedding and code-search tasks. Additionally, GCBS is easy to implement as it requires only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient than the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v2 [math.ST] UPDATED)
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. (arXiv:2301.07487v1 [cs.LG])
    Learning from raw high dimensional data via interaction with a given environment has been effectively achieved through the utilization of deep neural networks. Yet the observed degradation in policy performance caused by imperceptible worst-case policy dependent translations along high sensitivity directions (i.e. adversarial perturbations) raises concerns on the robustness of deep reinforcement learning policies. In our paper, we show that these high sensitivity directions do not lie only along particular worst-case directions, but rather are more abundant in the deep neural policy landscape and can be found via more natural means in a black-box setting. Furthermore, we show that vanilla training techniques intriguingly result in learning more robust policies compared to the policies learnt via the state-of-the-art adversarial training techniques. We believe our work lays out intriguing properties of the deep reinforcement learning policy manifold and our results can help to build robust and generalizable deep reinforcement learning policies.  ( 2 min )
    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v1 [eess.SP] CROSS LISTED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )
    Using Topological Data Analysis to classify Encrypted Bits. (arXiv:2301.07393v1 [cs.CR])
    We present a way to apply topological data analysis for classifying encrypted bits into distinct classes. Persistent homology is applied to generate topological features of a point cloud obtained from sets of encryptions. We see that this machine learning pipeline is able to classify our data successfully where classical models of machine learning fail to perform the task. We also see that this pipeline works as a dimensionality reduction method making this approach to classify encrypted data a realistic method to classify the given encryptioned bits.  ( 2 min )
    Physics-informed Information Field Theory for Modeling Physical Systems with Uncertainty Quantification. (arXiv:2301.07609v1 [stat.ML])
    Data-driven approaches coupled with physical knowledge are powerful techniques to model systems. The goal of such models is to efficiently solve for the underlying field by combining measurements with known physical laws. As many systems contain unknown elements, such as missing parameters, noisy data, or incomplete physical laws, this is widely approached as an uncertainty quantification problem. The common techniques to handle all the variables typically depend on the numerical scheme used to approximate the posterior, and it is desirable to have a method which is independent of any such discretization. Information field theory (IFT) provides the tools necessary to perform statistics over fields that are not necessarily Gaussian. We extend IFT to physics-informed IFT (PIFT) by encoding the functional priors with information about the physical laws which describe the field. The posteriors derived from this PIFT remain independent of any numerical scheme and can capture multiple modes, allowing for the solution of problems which are ill-posed. We demonstrate our approach through an analytical example involving the Klein-Gordon equation. We then develop a variant of stochastic gradient Langevin dynamics to draw samples from the joint posterior over the field and model parameters. We apply our method to numerical examples with various degrees of model-form error and to inverse problems involving nonlinear differential equations. As an addendum, the method is equipped with a metric which allows the posterior to automatically quantify model-form uncertainty. Because of this, our numerical experiments show that the method remains robust to even an incorrect representation of the physics given sufficient data. We numerically demonstrate that the method correctly identifies when the physics cannot be trusted, in which case it automatically treats learning the field as a regression problem.  ( 2 min )
    Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data. (arXiv:2301.07206v1 [stat.ML])
    Relating a set of variables X to a response y is crucial in chemometrics. A quantitative prediction objective can be enriched by qualitative data interpretation, for instance by locating the most influential features. When high-dimensional problems arise, dimension reduction techniques can be used. Most notable are projections (e.g. Partial Least Squares or PLS ) or variable selections (e.g. lasso). Sparse partial least squares combine both strategies, by blending variable selection into PLS. The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm. It provides balance between accurate prediction and efficient interpretation. It is based on penalizations inspired by classical regression methods (lasso, group lasso, least squares, ridge) and uses the dual norm notion. The resulting sparsity is enforced by an intuitive shrinking ratio parameter. Dual-sPLS favorably compares to similar regression methods, on simulated and real chemical data. Code is provided as an open-source package in R: \url{https://CRAN.R-project.org/package=dual.spls}.  ( 2 min )
    Discrete Latent Structure in Neural Networks. (arXiv:2301.07473v1 [cs.LG])
    Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties.  ( 2 min )
    Sample Complexity of Adversarially Robust Linear Classification on Separated Data. (arXiv:2012.10794v3 [cs.LG] UPDATED)
    We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. Motivated by some real applications, we consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least $\Omega(\frac{d}{n})$, whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$. This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that for very well-separated data, convergence rates of $O(\frac{1}{n})$ are achievable, which is not the case otherwise. Our results apply to robustness measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).  ( 2 min )
    Electronic excited states in deep variational Monte Carlo. (arXiv:2203.09472v3 [physics.chem-ph] UPDATED)
    Obtaining accurate ground and low-lying excited states of electronic systems is crucial in a multitude of important applications. One ab initio method for solving the Schr\"odinger equation that scales favorably for large systems is variational quantum Monte Carlo (QMC). The recently introduced deep QMC approach uses ansatzes represented by deep neural networks and generates nearly exact ground-state solutions for molecules containing up to a few dozen electrons, with the potential to scale to much larger systems where other highly accurate methods are not feasible. In this paper, we extend one such ansatz (PauliNet) to compute electronic excited states. We demonstrate our method on various small atoms and molecules and consistently achieve high accuracy for low-lying states. To highlight the method's potential, we compute the first excited state of the much larger benzene molecule, as well as the conical intersection of ethylene, with PauliNet matching results of more expensive high-level methods.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v5 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    Landscape Complexity for the Empirical Risk of Generalized Linear Models. (arXiv:1912.02143v5 [stat.ML] UPDATED)
    We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. Under a technical hypothesis, we obtain a rigorous explicit variational formula for the annealed complexity, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the quenched complexity, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.  ( 2 min )
    Strong inductive biases provably prevent harmless interpolation. (arXiv:2301.07605v1 [stat.ML])
    Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -- a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.  ( 2 min )
    Non-IID Quantum Federated Learning with One-shot Communication Complexity. (arXiv:2209.00768v2 [quant-ph] UPDATED)
    Federated learning refers to the task of machine learning based on decentralized data from multiple clients with secured data privacy. Recent studies show that quantum algorithms can be exploited to boost its performance. However, when the clients' data are not independent and identically distributed (IID), the performance of conventional federated algorithms is known to deteriorate. In this work, we explore the non-IID issue in quantum federated learning with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. This observation leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. Numerical simulations show that the proposed algorithm outperforms the conventional ones significantly under non-IID settings.  ( 2 min )
    An Analysis of Loss Functions for Binary Classification and Regression. (arXiv:2301.07638v1 [stat.ML])
    This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. It is shown that a large class of margin-based loss functions for binary classification/regression result in estimating scores equivalent to log-likelihood scores weighted by an even function. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses, including exponential loss, logistic loss, and others. The characterization is used to construct a new Huber-type loss function for the logistic model. A simple relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals. The relation provides new, straightforward interpretations for exponential and logistic loss, and aids in understanding why exponential loss is sensitive to outliers. In particular, it is shown that minimizing empirical exponential loss is equivalent to minimizing the sum of squared standardized logistic regression residuals. The relation also provides new insight into the AdaBoost algorithm.  ( 2 min )
    Large Deviations for Classification Performance Analysis of Machine Learning Systems. (arXiv:2301.07104v1 [cs.LG])
    We study the performance of machine learning binary classification techniques in terms of error probabilities. The statistical test is based on the Data-Driven Decision Function (D3F), learned in the training phase, i.e., what is thresholded before the final binary decision is made. Based on large deviations theory, we show that under appropriate conditions the classification error probabilities vanish exponentially, as $\sim \exp\left(-n\,I + o(n) \right)$, where $I$ is the error rate and $n$ is the number of observations available for testing. We also propose two different approximations for the error probability curves, one based on a refined asymptotic formula (often referred to as exact asymptotics), and another one based on the central limit theorem. The theoretical findings are finally tested using the popular MNIST dataset.  ( 2 min )

  • Open

    [D] ICLR 2023 results.
    Hi, Making a post for anything to be discussed related to ICLR 2023 results ​ One question I had: Is the exact time of result announcement fixed? submitted by /u/East-Beginning9987 [link] [comments]  ( 41 min )
    [Discussion] I'm Getting 50FPS With 4 Billion Parameters, Is That Good? - Compute Shader Implementation
    So I like to DIY a lot. I coded a neural network to run parallel in a compute shader with TanH activation, and the performance was much better than I expected. I Tested with a 3090 with many layers of 20,000 neurons until I reached 4 billion total parameters which ran around 50FPS when looped every frame. Is this above average performance for a GPU implementations? I haven't really tested out any other GPU implementations, so I was wondering if anyone here knows. submitted by /u/TheRPGGamerMan [link] [comments]  ( 42 min )
    I'm working on a project and need a open source chatbot that I can run locally and train to talk like a specific character, does anyone know one? [P]
    Title, I am trying to get a chatbot to act like Megumin, similar to Character AI, but open source and can be run on a local machine. Thank you! submitted by /u/otakuhacker123 [link] [comments]  ( 42 min )
    [D][P] Best Speech-to-Text Model for domain-specific data - Open Source vs. Paid Services
    I originally posted this here on r/learnmachinelearning but reposting here as it may be a more appropriate subreddit and/or may have a different perspective. I want a tool to programmatically generate transcripts from sermons. I have access to hundreds of sermon transcripts (and 100x more very similar in domain data) but less than a 40 transcripts with audio (~30 hours). I want the lowest WER (Word Error Rate) possible and can budge 100 hours for this project in 2023. Train my own acoustic model with the best open source offering OpenAI's Whisper seems to be the best available today? How much supervised data (e.g. hours of sermons with perfect transcripts) would I need to develop a model that would be more accurate than Google/AWS for my specific domain? Can I take a model already trained and "tune" it by augmenting the data I have? Use the best cloud speech-to-text API that I can provide in-domain data to to tune it AWS Transcribe and Google Speech to Text seem to be big players I've gone with AWS Transcribe since it can be tuned more easily with custom domain data (just upload text files) than Google's (which requires building phrase dictionaries with weights). Is there anything out there that's better for my use case? submitted by /u/Knecht_Christi [link] [comments]  ( 43 min )
    [D] Pre-trained Models for Domain-Specific (i.e. Stylistic) Feature Extraction
    Most or all of the style transfer models are used to extract the domain-independent (robust) feature from an artpiece so as to apply it to different styles. But I need the opposite: I need a pretrained model that can extract the domain-specific (i.e. stylistic) feature from an artpiece. Are there any publicly available ones I can use? It doesn't matter whether it's a Github repository, a Huggingface API, or something else. Thank you! submitted by /u/No_Zookeepergame8794 [link] [comments]  ( 42 min )
    [P] Labeling tools are great, but what about quality checks?
    Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. MLfix is a new open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors. https://www.collabora.com/news-and-blog/blog/2023/01/17/labeling-tools-are-great-but-what-about-quality-checks/ submitted by /u/mfilion [link] [comments]  ( 42 min )
    [D] is it time to investigate retrieval language models?
    With ChatGPT going mainstream and the general push to make products out of LM, a problem remain about the cost of running such models. To me, it seems counterproductive to put both language modelling and knowledge inside the model weights. Is it time to shift to retrieval LM like Retro to keep the cost down while offering the same products? It would possibly allow Google or others to offer a free assistant service, using embeddings similarity search to retrieve results from the Internet so the model itself could possibly even run on edge devices? What are your thoughts about that subject? submitted by /u/hapliniste [link] [comments]  ( 46 min )
    [P] Code super clean multi-modal PyTorch models and easily serve them through FastAPI, using DocArray
    Hi all! I'd like to share an open source project that I am currently working on together with a few colleagues: DocArray! If you've ever trained models that deal with different data types (images, text, video, audio, ...) then you know how much of a hassle it can be to keep track of all of your tensors, what shapes they have, and what data they are meant to represent. That's what we're trying to change with DocArray, a Python library for representing, sending, and storing multi-modal data! The core idea of DocArray is that you define Documents that represent your data. For example, one Document could hold the file path to an image, its image tensor, and and image embedding that your model creates. A different Document could do the same thing for some Text, and a third Document might co…  ( 44 min )
    [P] Tired of generating synthetic corgis❓🐶 Check out Synthcity, a framework for synthetic tabular data
    🌟 Synthcity isa library for generating and benchmarking synthetic tabular data. https://github.com/vanderschaarlab/synthcity ​ 🚀 Synthcity includes a wide range of algorithms for various use cases, such as: - tabular data(CTGAN, TVAE, Bayesian Networks etc) - survival analysis(SurvivalGAN etc). - time series(Fourier Flows, TimeGAN, etc.). - privacy-focused(DP-GAN, PATEGAN, AdsGAN, DECAF). - domain adaptation(RadialGAN). ​ 🔍 Synthcity supports benchmarking multiple algorithms, testing data quality, downstream performance, statistical fidelity, and privacy metrics. ​ 🌀 Give it a try: - Library: https://github.com/vanderschaarlab/synthcity - Tutorial: https://colab.research.google.com/drive/1Vr2PJswgfFYBkJCm3hhVkuH-9dXnHeYV?usp=sharing - Docs: https://synthcity.readthedocs.io/ submitted by /u/ManagementBig2995 [link] [comments]  ( 42 min )
    [P] Need some recs on an NLP project
    Hello, for my job, I have to extract job responsibilities from job ads. I'm thinking of approaching it as a span extraction problem. Where I'm gonna label the job responsibility span manually for around 1000 samples. And use supervised learning. Is there any better way to approach this problem? Is there any pretrained model I can use to fine tune? Any suggestion will be appreciated. Thanks! submitted by /u/Salekeen01 [link] [comments]  ( 43 min )
    [R] Human-Timescale Adaptation in an Open-Ended Task Space - (AdA) - DeepMind 2023 - Can adapt to open-ended novel embodied 3D problems as quickly as humans!
    Paper: https://arxiv.org/abs/2301.07608 Youtube: https://www.youtube.com/watch?v=U93bUQ1roiw Please watch the Video the explanations are better than me giving you 3-5 Pictures! Abstract: Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains. https://preview.redd.it/is3pyl1p70da1.jpg?width=1424&format=pjpg&auto=webp&s=8d102af4202711be7e01619577f109b6598d6ed5 submitted by /u/Singularian2501 [link] [comments]  ( 44 min )
    [D] Question about using diffusion to denoise images
    Hi all, I am trying to see if I can use DDPM (Denoising Diffusion Probabilistic Model) to denoise images using a supervised learning approach. However, I've learned that DDPM is only for unconditional image generation. Has anyone had experience using conditional DDPM and could help me out with some conceptual questions? Here's what I'm trying to understand: Say I have a pair of noisy and clean ground truth images. Should I take my clean image and gradually corrupt it by adding gaussian noise in the forward diffusion (FD) process? Could I get the network to learn the reverse diffusion process by giving it the noisy input, the FD noisy image, and positional embeddings? I was planning on concatenating the noisy input with the FD noisy image. During training, the network learns to predict noise at t-1 given the image at t conditioned on the input noisy source image. Here is an image showing you what I mean. Any thoughts or suggestions would be greatly appreciated. DDPM for image denoising submitted by /u/CurrentlyJoblessFML [link] [comments]  ( 44 min )
    [D] What is the name of this NLP technique?
    Lets say I have a dataset of real estate listings. I have a column of text that describes the listing, and another column that shows the number of rooms for example. In most of the cases, the number of rooms is shown in both columns, in the description text and also in the dedicated column. But for some observations, the number of rooms is in the description text but not in the column "number of rooms". So I have missing data. I could try to fill the missing data with by applying regex in the description text, but the number of possibilities seems to big. Is there a machine learning technique in NLP that allows me to do that, since it most of the observations the data is present in both column, so is "naturally labelled"? If there is, what is the name of these techniques? I would like to search about it but I don't know the proper keywords to google. submitted by /u/Kebet-Mendez [link] [comments]  ( 43 min )
    [D] Inner workings of the chatgpt memory
    All the examples from langchain and on huggingface create memory by pasting the entire history in every prompt. This seems to violate the max input prompt length pretty quickly. And it’s expensive. Does chatgpt use something revolutionary? It forgets everything when you create a new session so it ‘feels’ it’s using the convo as memory as well. But then the question; how do they get past prompt limits? Chunking doesn’t help as it still doesn’t get context in that case between prompts. Maybe they ask the same question with different chunks many times and then ask for a final result? Apologies if this was answered somewhere, I cannot find it at all and all examples use the same kind of history memory. submitted by /u/terserterseness [link] [comments]  ( 45 min )
    [D] very short video generation
    Hi, my indie game devs asked me if I could build a model that generates cool movements for their characters. 1) I wanted to start by generating characters and scene. Should I go for stable diffusion or for a GAN ? I do need a prompt 2) do you know any model that can generate short video clips (2 seconds) and that could potentially generate character movement ? Thank you so much ! submitted by /u/Frizzoux [link] [comments]  ( 42 min )
    [D] ML Researchers/Engineers in Industry: Why don't companies use open source models more often?
    In my experience at big tech, I've never seen any company use open-source ML models in production. Why is this the case? Curious, because there seems to be some insanely cool research going on these days. On the other hand, if you have seen this used, what kind of repos have you guys seen? submitted by /u/tennismlandguitar [link] [comments]  ( 46 min )
    [Discussion] Storing hundreds of ML models - what do you use?
    I am currently using the Google Cloud Model Registry and I want to learn what you use for archiving your machine learning models. What are the other options for developers who have to store hundreds of models? https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-model-registry submitted by /u/May-is-spring [link] [comments]  ( 42 min )
    [D] Best LLM for Question/Answering with personality?
    Hi, I am looking at fine tuning an open source LLM to answer questions as a specific character from chat on discord. I'm trying to decide which one to test between KoboldAI, GPTJ, Neo, Flan-T5, etc. Has anyone tries these LLMs and knows which would be best for this use case? The use case is to have a character that answers questions from discord chat in the form a specific character, with a personality and can be mean for example, very similar to https://beta.character.ai/ Does anyone want to guess what they use based on their experience or the closest LLM to replicate it? if this helps sometimes the model will go off on tangential stories which makes me think maybe it was trained on a novel or story dataset originally. submitted by /u/TernaryJimbo [link] [comments]  ( 42 min )
  • Open

    YouTube Video scripted entirely by ChatGPT???
    submitted by /u/McFIyyy [link] [comments]  ( 40 min )
    Inventory Management?
    I read a study that showed AI is being deployed into inventory management at a whopping 44% of total usage, though I can't quantify this, I am interested to know how to combine AI with my ERP, static data, a database etc. Does anyone here have experience with AI inventory management for MoQ, forecasting etc that they recommend? I want to do some homework submitted by /u/smudgepost [link] [comments]  ( 40 min )
    WhatsApp ChatGPT
    Hello i wanna ask you guys about this AI whatapp tool https://chattycat.ju.mp/ is it safe i saw it in a tweet and i was wondering if its the same as chat gpt and is it safe to give your name number and email " original tweet " https://twitter.com/TansuYegen/status/1616138232894492672 ​ WhatsApp gpt submitted by /u/Jnxe [link] [comments]  ( 40 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters don’t misunderstand AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    The Next Step (I believe)
    We have already seen the capabilities of AI sourcing from the Internet to create, learn, and exceed our expectations. I want to be quick with my description so I don't loose anybody. The next step I believe will be AI catered to us individually. What I mean is an app that only cares for one person, that being you. It will comb over (with permission hopefully) everything you ever put online. Also track writting, Grammer, interests (internet of things information), about you. Then if you allow it, can go over your medical or financial history, even psychology. It will feel like a chatbot but be a middleman towards seeking healthcare, mental care, and may other things including niché interests and possible career routes. I don't propose this to invite any fear or anxiety of a sci-fi narrative, but to objectively observe the trends with AI and my belief where it is going. submitted by /u/DropDeddBlue [link] [comments]  ( 41 min )
    2 months to make ai video last summer: “The technology was evolving so fast that I worried my video would feel outdated by the time it would be ready.”
    submitted by /u/defensiveFruit [link] [comments]  ( 40 min )
    Announcing a major update to Perplexity Ask: the world’s first conversational search engine! Now, you can read answers with up-to-date sources and ask follow-up questions to dig deeper. In other words, you can chat with your search engine!
    submitted by /u/rafs2006 [link] [comments]  ( 40 min )
    is the data we have today enough to create AGI?
    let's say hypothetically we were only able to work with digital data we have collected up until today to try and create AGI, would it be possible? submitted by /u/Science_is_Greatness [link] [comments]  ( 42 min )
    Labeling tools are great, but what about quality checks?
    Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. MLfix is a new open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors. https://www.collabora.com/news-and-blog/blog/2023/01/17/labeling-tools-are-great-but-what-about-quality-checks/ submitted by /u/mfilion [link] [comments]  ( 40 min )
    Google Research And DeepMind Create AI Medical Chatbot That Can Generate Safe And Helpful Answers!
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Software to read "Once upon a time in Hollywood" book in character voices from film.
    I want an audiobook version of the book "Once upon a time in Hollywood" by Tarantino with the actors voices from the movie, using the voices from the movie as sample. How do I do that? submitted by /u/Art3mis_eros [link] [comments]  ( 40 min )
    Professor Initiates Integration of ChatGPT in Classroom
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    Neural Network 'Hallucinating' While Training On Dog Images
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 41 min )
    Wi-Fi Could Help Identify When You’re Struggling to Breathe
    submitted by /u/goronmask [link] [comments]  ( 40 min )
    Summarizing Text using In-database NLP through the Integration of Hugging Face with MindsDB
    submitted by /u/Klutzy_Accountant113 [link] [comments]  ( 40 min )
    Farmers Spend $5Billion a Year on Antibiotics For Their Animals using a Blanket Approach. Medicate all to Prevent Infection. AI Models are Being Used to Identify & Medicate Only Animals That are Actually Sick
    submitted by /u/HODLTID [link] [comments]  ( 40 min )
    I got frustrated with the time and effort required to code and maintain custom web scrapers, so I built an LLM-powered tool that can comprehend any website structure and extract the desired data in the preferred format.
    submitted by /u/madredditscientist [link] [comments]  ( 41 min )
    Generative AI Technology Is Discovering Completely New Drugs
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    Join us this Friday 6 pm EST for a fascinating discussion about the societal impact of large language models (LLMs) like ChatGPT
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    Exclusive: The $2 Per Hour Workers Who Made ChatGPT Safer
    submitted by /u/ohmsalad [link] [comments]  ( 40 min )
    Hey guys! I made some of the cartoon characters to look like villains. Can you guess which cartoon they are from?
    submitted by /u/_aimnftri [link] [comments]  ( 40 min )
    Made with AI
    submitted by /u/NorthTs [link] [comments]  ( 40 min )
  • Open

    Probability problem with Pratt prime proofs
    In the process of creating a Pratt certificate to prove that a number n is prime, you have to find a number a that seems kinda arbitrary. As we discussed here, a number n is prime if there exists a number a such that an-1 = 1 mod n and a(n-1)/p ≠ 1 mod n […] Probability problem with Pratt prime proofs first appeared on John D. Cook.  ( 5 min )
    Factoring b^n + 1
    The previous post illustrated a technique for finding factors of number of the form bn – 1. This post will look at an analogous, though slightly less general, technique for numbers of the form bn + 1. There is a theorem that says that if m divides n then bm + 1 divides bn + […] Factoring b^n + 1 first appeared on John D. Cook.  ( 5 min )
    Factoring b^n – 1
    Suppose you want to factor a number of the form bn – 1. There is a theorem that says that if m divides n then bm – 1 divides bn – 1. Let’s use this theorem to try to factor J = 22023 – 1, a 609-digit number. Factoring such a large number would be more difficult if it didn’t have […] Factoring b^n – 1 first appeared on John D. Cook.  ( 6 min )
  • Open

    AI’s Leg Up: Startup Accelerates Robotics Simulation for $8 Trillion Food Market
    Robots are finally getting a grip. Developers have been striving to close the gap on robotic gripping for the past several years, pursuing applications for multibillion-dollar industries. Securely gripping and transferring fast-moving items on conveyor belts holds vast promise for businesses. Soft Robotics, a Bedford, Mass., startup, is harnessing NVIDIA Isaac Sim to help close Read article >  ( 6 min )
    The Ultimate Upgrade: GeForce RTX 4080 SuperPOD Rollout Begins Today
    The Ultimate upgrade begins today: GeForce NOW RTX 4080 SuperPODs are now rolling out, bringing a new level of high-performance gaming to the cloud. Ultimate members will start to see RTX 4080 performance in their region soon, and experience titles like  Warhammer 40,000: Darktide, Cyberpunk 2077, The Witcher 3: Wild Hunt and more at ultimate Read article >  ( 5 min )
  • Open

    Human-Timescale Adaptation in an Open-Ended Task Space - (AdA) - DeepMind 2023 - Can adapt to open-ended novel embodied 3D problems as quickly as humans!
    Paper: https://arxiv.org/abs/2301.07608 Youtube: https://www.youtube.com/watch?v=U93bUQ1roiw Please watch the Video the explanations are better than me giving you 3-5 Pictures! Abstract: Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains. https://preview.redd.it/rcta0vvt80da1.jpg?width=1424&format=pjpg&auto=webp&s=da7cc5745a21969b1687b7cf2a8c590dcac72ae0 submitted by /u/Singularian2501 [link] [comments]  ( 41 min )
    On the legal status of downloading and using ATARI 2600 ROMs
    Hi there, I have searched online quite broadly, and could not find an answer anywhere on the following questions: Is it legal to download ATARI ROMs, e.g., the ones in https://github.com/Farama-Foundation/AutoROM? Is it legal to use those ROMs? (Imagine I acquired them in a different way than downloading, like physical shipping on USB drive) If both are illegal, what's the legal situation around using them for reinforcement learning research? What I know from reading: Downloading ROMs is a gray area. It is supposedly illegal, but if the copyright holder knows about a use of their ROMs and doesn't do anything, there exist and "implied license" allowing their use. Source: here. Providing a means to download these ROMs is not illegal, it is allowed. Source: here. ​ Does anyone have more info or legal experience with this? ​ Thanks submitted by /u/Conscious_Heron_9133 [link] [comments]  ( 42 min )
    trying to make a single actuated link stay upright with dqn. The maximum score you can get from the game is 20,000, and the highest score I am getting is 200.
    submitted by /u/blackgentrifier [link] [comments]  ( 40 min )
  • Open

    Equivariant Networks for Crystal Structures. (arXiv:2211.15420v2 [cond-mat.mtrl-sci] UPDATED)
    Supervised learning with deep models has tremendous potential for applications in materials science. Recently, graph neural networks have been used in this context, drawing direct inspiration from models for molecules. However, materials are typically much more structured than molecules, which is a feature that these models do not leverage. In this work, we introduce a class of models that are equivariant with respect to crystalline symmetry groups. We do this by defining a generalization of the message passing operations that can be used with more general permutation groups, or that can alternatively be seen as defining an expressive convolution operation on the crystal graph. Empirically, these models achieve competitive results with state-of-the-art on property prediction tasks.
    Multi-Task Imitation Learning for Linear Dynamical Systems. (arXiv:2212.00186v2 [cs.LG] UPDATED)
    We study representation learning for efficient imitation learning over linear systems. In particular, we consider a setting where learning is split into two phases: (a) a pre-training step where a shared $k$-dimensional representation is learned from $H$ source policies, and (b) a target policy fine-tuning step where the learned representation is used to parameterize the policy class. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $\tilde{O}\left( \frac{k n_x}{HN_{\mathrm{shared}}} + \frac{k n_u}{N_{\mathrm{target}}}\right)$, where $n_x > k$ is the state dimension, $n_u$ is the input dimension, $N_{\mathrm{shared}}$ denotes the total amount of data collected for each policy during representation learning, and $N_{\mathrm{target}}$ is the amount of target task data. This result formalizes the intuition that aggregating data across related tasks to learn a representation can significantly improve the sample efficiency of learning a target task. The trends suggested by this bound are corroborated in simulation.
    FedALA: Adaptive Local Aggregation for Personalized Federated Learning. (arXiv:2212.01197v2 [cs.LG] UPDATED)
    A key challenge in federated learning (FL) is the statistical heterogeneity that impairs the generalization of the global model on each client. To address this, we propose a method Federated learning with Adaptive Local Aggregation (FedALA) by capturing the desired information in the global model for client models in personalized FL. The key component of FedALA is an Adaptive Local Aggregation (ALA) module, which can adaptively aggregate the downloaded global model and local model towards the local objective on each client to initialize the local model before training in each iteration. To evaluate the effectiveness of FedALA, we conduct extensive experiments with five benchmark datasets in computer vision and natural language processing domains. FedALA outperforms eleven state-of-the-art baselines by up to 3.27% in test accuracy. Furthermore, we also apply ALA module to other federated learning methods and achieve up to 24.19% improvement in test accuracy.
    FeSAC: Federated Learning-Based Soft Actor-Critic Traffic Offloading in Space-Air-Ground Integrated Network. (arXiv:2212.02075v2 [cs.NI] UPDATED)
    With the increase of intelligent devices leading to increasing demand for traffic, traffic offloading has become a challenging problem. The space-air-ground integrated network (SAGIN) is a superior network architecture to solve this problem. The existing research on SAGIN traffic offloading only considers the single-layer satellite network in the space network. To further expand the resource pool of traffic offloading in SAGIN, we extend the single-layer satellite network into a double-layer satellite network composed of low-orbit satellites (LEO) and high-orbit satellites (GEO). And re-model a four-layer SAGIN architecture consisting of the ground network, the air network, LEO and GEO. Furthermore, we propose a novel Federated Soft Actor-Critic (FeSAC) traffic offloading method with positive environmental exploration to accommodate this dynamic and complex four-layer SAGIN architecture. The FeSAC method uses federated learning to train traffic offloading nodes and then aggregate the training results to obtain the best traffic offloading strategy. The simulation results show that under the four-layer SAGIN, our proposed method can better adapt to the network environment changes by nodes mobility and is better than the existing traffic offloading methods in throughput, packet loss, and transmission delay.
    Beyond ADMM: A Unified Client-variance-reduced Adaptive Federated Learning Framework. (arXiv:2212.01519v2 [cs.LG] UPDATED)
    As a novel distributed learning paradigm, federated learning (FL) faces serious challenges in dealing with massive clients with heterogeneous data distribution and computation and communication resources. Various client-variance-reduction schemes and client sampling strategies have been respectively introduced to improve the robustness of FL. Among others, primal-dual algorithms such as the alternating direction of method multipliers (ADMM) have been found being resilient to data distribution and outperform most of the primal-only FL algorithms. However, the reason behind remains a mystery still. In this paper, we firstly reveal the fact that the federated ADMM is essentially a client-variance-reduced algorithm. While this explains the inherent robustness of federated ADMM, the vanilla version of it lacks the ability to be adaptive to the degree of client heterogeneity. Besides, the global model at the server under client sampling is biased which slows down the practical convergence. To go beyond ADMM, we propose a novel primal-dual FL algorithm, termed FedVRA, that allows one to adaptively control the variance-reduction level and biasness of the global model. In addition, FedVRA unifies several representative FL algorithms in the sense that they are either special instances of FedVRA or are close to it. Extensions of FedVRA to semi/un-supervised learning are also presented. Experiments based on (semi-)supervised image classification tasks demonstrate superiority of FedVRA over the existing schemes in learning scenarios with massive heterogeneous clients and client sampling.
    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v2 [cs.LG] UPDATED)
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D molecular graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D molecular graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.
    Unbalanced Optimal Transport, from Theory to Numerics. (arXiv:2211.08775v2 [stat.ML] UPDATED)
    Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.
    Algorithmic progress in computer vision. (arXiv:2212.05153v3 [cs.CV] UPDATED)
    We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms. Using Shapley values to attribute performance improvements, we find that algorithmic improvements have been roughly as important as the scaling of compute for progress computer vision. Our estimates indicate that algorithmic innovations mostly take the form of compute-augmenting algorithmic advances (which enable researchers to get better performance from less compute), not data-augmenting algorithmic advances. We find that compute-augmenting algorithmic advances are made at a pace more than twice as fast as the rate usually associated with Moore's law. In particular, we estimate that compute-augmenting innovations halve compute requirements every nine months (95\% confidence interval: 4 to 25 months).
    Accelerated Riemannian Optimization: Handling Constraints with a Prox to Bound Geometric Penalties. (arXiv:2211.14645v2 [math.OC] UPDATED)
    We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in a wide class of Hadamard manifolds. We achieve the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works \emph{resort to assuming} that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods of limited applicability. For our manifolds, this solves the open question in [KY22] about obtaining global general acceleration without iterates assumptively staying in the feasible set. In our solution, we design an accelerated Riemannian inexact proximal point algorithm, which is a result that was unknown even with exact access to the proximal operator, and is of independent interest. For smooth functions, we show we can implement the prox step inexactly with first-order methods in Riemannian balls of certain diameter that is enough for global accelerated optimization.
    The Benefits of Model-Based Generalization in Reinforcement Learning. (arXiv:2211.02222v2 [cs.LG] UPDATED)
    Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.
    Geometric Knowledge Distillation: Topology Compression for Graph Neural Networks. (arXiv:2210.13014v2 [cs.LG] UPDATED)
    We study a new paradigm of knowledge transfer that aims at encoding graph topological information into graph neural networks (GNNs) by distilling knowledge from a teacher GNN model trained on a complete graph to a student GNN model operating on a smaller or sparser graph. To this end, we revisit the connection between thermodynamics and the behavior of GNN, based on which we propose Neural Heat Kernel (NHK) to encapsulate the geometric property of the underlying manifold concerning the architecture of GNNs. A fundamental and principled solution is derived by aligning NHKs on teacher and student models, dubbed as Geometric Knowledge Distillation. We develop non- and parametric instantiations and demonstrate their efficacy in various experimental settings for knowledge distillation regarding different types of privileged topological information and teacher-student schemes.
    SimVP: Towards Simple yet Powerful Spatiotemporal Predictive Learning. (arXiv:2211.12509v2 [cs.LG] UPDATED)
    Recent years have witnessed remarkable advances in spatiotemporal predictive learning, incorporating auxiliary inputs, elaborate neural architectures, and sophisticated training strategies. Although impressive, the system complexity of mainstream methods is increasing as well, which may hinder the convenient applications. This paper proposes SimVP, a simple spatiotemporal predictive baseline model that is completely built upon convolutional networks without recurrent architectures and trained by common mean squared error loss in an end-to-end fashion. Without introducing any extra tricks and strategies, SimVP can achieve superior performance on various benchmark datasets. To further improve the performance, we derive variants with the gated spatiotemporal attention translator from SimVP that can achieve better performance. We demonstrate that SimVP has strong generalization and extensibility on real-world datasets through extensive experiments. The significant reduction in training cost makes it easier to scale to complex scenarios. We believe SimVP can serve as a solid baseline to benefit the spatiotemporal predictive learning community.
    torchode: A Parallel ODE Solver for PyTorch. (arXiv:2210.12375v2 [cs.LG] UPDATED)
    We introduce an ODE solver for the PyTorch ecosystem that can solve multiple ODEs in parallel independently from each other while achieving significant performance gains. Our implementation tracks each ODE's progress separately and is carefully optimized for GPUs and compatibility with PyTorch's JIT compiler. Its design lets researchers easily augment any aspect of the solver and collect and analyze internal solver statistics. In our experiments, our implementation is up to 4.3 times faster per step than other ODE solvers and it is robust against within-batch interactions that lead other solvers to take up to 4 times as many steps. Code available at https://github.com/martenlienen/torchode
    Black-box Coreset Variational Inference. (arXiv:2211.02377v2 [stat.ML] UPDATED)
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.
    Synthetic Dataset Generation for Privacy-Preserving Machine Learning. (arXiv:2210.03205v4 [cs.CR] UPDATED)
    Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets cannot be publicly released if they contain sensitive information such as medical records, and data privacy becomes a major concern. Encryption methods could be a possible solution, however their deployment on ML applications seriously impacts classification accuracy and results in substantial computational overhead. Alternatively, obfuscation techniques could be used, but maintaining a good trade-off between visual privacy and accuracy is challenging. In this paper, we propose a method to generate secure synthetic datasets from the original private datasets. Given a network with Batch Normalization (BN) layers pretrained on the original dataset, we first record the class-wise BN layer statistics. Next, we generate the synthetic dataset by optimizing random noise such that the synthetic data match the layer-wise statistical distribution of original images. We evaluate our method on image classification datasets (CIFAR10, ImageNet) and show that synthetic data can be used in place of the original CIFAR10/ImageNet data for training networks from scratch, producing comparable classification performance. Further, to analyze visual privacy provided by our method, we use Image Quality Metrics and show high degree of visual dissimilarity between the original and synthetic images. Moreover, we show that our proposed method preserves data-privacy under various privacy-leakage attacks including Gradient Matching Attack, Model Memorization Attack, and GAN-based Attack.
    Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. (arXiv:2211.14238v2 [cs.LG] UPDATED)
    Distribution shift occurs when the test distribution differs from the training distribution, and it can considerably degrade performance of machine learning models deployed in the real world. Temporal shifts -- distribution shifts arising from the passage of time -- often occur gradually and have the additional structure of timestamp metadata. By leveraging timestamp metadata, models can potentially learn from trends in past distribution shifts and extrapolate into the future. While recent works have studied distribution shifts, temporal shifts remain underexplored. To address this gap, we curate Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning. We use two evaluation strategies: evaluation with a fixed time split (Eval-Fix) and evaluation with a data stream (Eval-Stream). Eval-Fix, our primary evaluation strategy, aims to provide a simple evaluation protocol, while Eval-Stream is more realistic for certain real-world applications. Under both evaluation strategies, we observe an average performance drop of 20% from in-distribution to out-of-distribution data. Existing methods are unable to close this gap. Code is available at https://wild-time.github.io/.
    Local Bayesian optimization via maximizing probability of descent. (arXiv:2210.11662v2 [cs.LG] UPDATED)
    Local optimization presents a promising approach to expensive, high-dimensional black-box optimization by sidestepping the need to globally explore the search space. For objective functions whose gradient cannot be evaluated directly, Bayesian optimization offers one solution -- we construct a probabilistic model of the objective, design a policy to learn about the gradient at the current location, and use the resulting information to navigate the objective landscape. Previous work has realized this scheme by minimizing the variance in the estimate of the gradient, then moving in the direction of the expected gradient. In this paper, we re-examine and refine this approach. We demonstrate that, surprisingly, the expected value of the gradient is not always the direction maximizing the probability of descent, and in fact, these directions may be nearly orthogonal. This observation then inspires an elegant optimization scheme seeking to maximize the probability of descent while moving in the direction of most-probable descent. Experiments on both synthetic and real-world objectives show that our method outperforms previous realizations of this optimization scheme and is competitive against other, significantly more complicated baselines.
    Betting the system: Using lineups to predict football scores. (arXiv:2210.06327v3 [cs.LG] UPDATED)
    This paper aims to reduce randomness in football by analysing the role of lineups in final scores using machine learning prediction models we have developed. Football clubs invest millions of dollars on lineups and knowing how individual statistics translate to better outcomes can optimise investments. Moreover, sports betting is growing exponentially and being able to predict the future is profitable and desirable. We use machine learning models and historical player data from English Premier League (2020-2022) to predict scores and to understand how individual performance can improve the outcome of a match. We compared different prediction techniques to maximise the possibility of finding useful models. We created heuristic and machine learning models predicting football scores to compare different techniques. We used different sets of features and shown goalkeepers stats are more important than attackers stats to predict goals scored. We applied a broad evaluation process to assess the efficacy of the models in real world applications. We managed to predict correctly all relegated teams after forecast 100 consecutive matches. We show that Support Vector Regression outperformed other techniques predicting final scores and that lineups do not improve predictions. Finally, our model was profitable (42% return) when emulating a betting system using real world odds data.
    Towards Out-of-Distribution Sequential Event Prediction: A Causal Treatment. (arXiv:2210.13005v2 [cs.LG] UPDATED)
    The goal of sequential event prediction is to estimate the next event based on a sequence of historical events, with applications to sequential recommendation, user behavior analysis and clinical treatment. In practice, the next-event prediction models are trained with sequential data collected at one time and need to generalize to newly arrived sequences in remote future, which requires models to handle temporal distribution shift from training to testing. In this paper, we first take a data-generating perspective to reveal a negative result that existing approaches with maximum likelihood estimation would fail for distribution shift due to the latent context confounder, i.e., the common cause for the historical events and the next event. Then we devise a new learning objective based on backdoor adjustment and further harness variational inference to make it tractable for sequence learning problems. On top of that, we propose a framework with hierarchical branching structures for learning context-specific representations. Comprehensive experiments on diverse tasks (e.g., sequential recommendation) demonstrate the effectiveness, applicability and scalability of our method with various off-the-shelf models as backbones.
    An Exponentially Converging Particle Method for the Mixed Nash Equilibrium of Continuous Games. (arXiv:2211.01280v2 [math.OC] UPDATED)
    We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as a time-implicit discretization of the "interacting" Wasserstein-Fisher-Rao gradient flow. We prove that, under non-degeneracy assumptions, this method converges at an exponential rate to the exact mixed Nash equilibrium from any initialization satisfying a natural notion of closeness to optimality. We illustrate our results with numerical experiments and discuss applications to max-margin and distributionally-robust classification using two-layer neural networks, where our method has a natural interpretation as a simultaneous training of the network's weights and of the adversarial distribution.
    Keypoint-GraspNet: Keypoint-based 6-DoF Grasp Generation from the Monocular RGB-D input. (arXiv:2209.08752v2 [cs.RO] UPDATED)
    Great success has been achieved in the 6-DoF grasp learning from the point cloud input, yet the computational cost due to the point set orderlessness remains a concern. Alternatively, we explore the grasp generation from the RGB-D input in this paper. The proposed solution, Keypoint-GraspNet, detects the projection of the gripper keypoints in the image space and then recover the SE(3) poses with a PnP algorithm. A synthetic dataset based on the primitive shape and the grasp family is constructed to examine our idea. Metric-based evaluation reveals that our method outperforms the baselines in terms of the grasp proposal accuracy, diversity, and the time cost. Finally, robot experiments show high success rate, demonstrating the potential of the idea in the real-world applications.
    Deep Counterfactual Estimation with Categorical Background Variables. (arXiv:2210.05811v4 [cs.LG] UPDATED)
    Referred to as the third rung of the causal inference ladder, counterfactual queries typically ask the "What if ?" question retrospectively. The standard approach to estimate counterfactuals resides in using a structural equation model that accurately reflects the underlying data generating process. However, such models are seldom available in practice and one usually wishes to infer them from observational data alone. Unfortunately, the correct structural equation model is in general not identifiable from the observed factual distribution. Nevertheless, in this work, we show that under the assumption that the main latent contributors to the treatment responses are categorical, the counterfactuals can be still reliably predicted. Building upon this assumption, we introduce CounterFactual Query Prediction (CFQP), a novel method to infer counterfactuals from continuous observations when the background variables are categorical. We show that our method significantly outperforms previously available deep-learning-based counterfactual methods, both theoretically and empirically on time series and image data. Our code is available at https://github.com/edebrouwer/cfqp.
    Automatic Generation of Product Concepts from Positive Examples, with an Application to Music Streaming. (arXiv:2210.01515v3 [cs.LG] UPDATED)
    Internet based businesses and products (e.g. e-commerce, music streaming) are becoming more and more sophisticated every day with a lot of focus on improving customer satisfaction. A core way they achieve this is by providing customers with an easy access to their products by structuring them in catalogues using navigation bars and providing recommendations. We refer to these catalogues as product concepts, e.g. product categories on e-commerce websites, public playlists on music streaming platforms. These product concepts typically contain products that are linked with each other through some common features (e.g. a playlist of songs by the same artist). How they are defined in the backend of the system can be different for different products. In this work, we represent product concepts using database queries and tackle two learning problems. First, given sets of products that all belong to the same unknown product concept, we learn a database query that is a representation of this product concept. Second, we learn product concepts and their corresponding queries when the given sets of products are associated with multiple product concepts. To achieve these goals, we propose two approaches that combine the concepts of PU learning with Decision Trees and Clustering. Our experiments demonstrate, via a simulated setup for a music streaming service, that our approach is effective in solving these problems.
    Improved Bounds on Neural Complexity for Representing Piecewise Linear Functions. (arXiv:2210.07236v3 [cs.LG] UPDATED)
    A deep neural network using rectified linear units represents a continuous piecewise linear (CPWL) function and vice versa. Recent results in the literature estimated that the number of neurons needed to exactly represent any CPWL function grows exponentially with the number of pieces or exponentially in terms of the factorial of the number of distinct linear components. Moreover, such growth is amplified linearly with the input dimension. These existing results seem to indicate that the cost of representing a CPWL function is expensive. In this paper, we propose much tighter bounds and establish a polynomial time algorithm to find a network satisfying these bounds for any given CPWL function. We prove that the number of hidden neurons required to exactly represent any CPWL function is at most a quadratic function of the number of pieces. In contrast to all previous results, this upper bound is invariant to the input dimension. Besides the number of pieces, we also study the number of distinct linear components in CPWL functions. When such a number is also given, we prove that the quadratic complexity turns into bilinear, which implies a lower neural complexity because the number of distinct linear components is always not greater than the minimum number of pieces in a CPWL function. When the number of pieces is unknown, we prove that, in terms of the number of distinct linear components, the neural complexities of any CPWL function are at most polynomial growth for low-dimensional inputs and factorial growth for the worst-case scenario, which are significantly better than existing results in the literature.
    Semi-Supervised Junction Tree Variational Autoencoder for Molecular Property Prediction. (arXiv:2208.05119v5 [cs.LG] UPDATED)
    Molecular Representation Learning is essential to solving many drug discovery and computational chemistry problems. It is a challenging problem due to the complex structure of molecules and the vast chemical space. Graph representations of molecules are more expressive than traditional representations, such as molecular fingerprints. Therefore, they can improve the performance of machine learning models. We propose SeMole, a method that augments the Junction Tree Variational Autoencoders, a state-of-the-art generative model for molecular graphs, with semi-supervised learning. SeMole aims to improve the accuracy of molecular property prediction when having limited labeled data by exploiting unlabeled data. We enforce that the model generates molecular graphs conditioned on target properties by incorporating the property into the latent representation. We propose an additional pre-training phase to improve the training process for our semi-supervised generative model. We perform an experimental evaluation on the ZINC dataset using three different molecular properties and demonstrate the benefits of semi-supervision.
    Neural Observer with Lyapunov Stability Guarantee for Uncertain Nonlinear Systems. (arXiv:2208.13006v2 [math.OC] UPDATED)
    In this paper, we propose a novel nonlinear observer based on neural networks, called neural observer, for observation tasks of linear time-invariant (LTI) systems and uncertain nonlinear systems. In particular, the neural observer designed for uncertain systems is inspired by the active disturbance rejection control, which can measure the uncertainty in real-time. The stability analysis (e.g., exponential convergence rate) of LTI and uncertain nonlinear systems (involving neural observers) are presented and guaranteed, where it is shown that the observation problems can be solved only using the linear matrix inequalities (LMIs). Also, it is revealed that the observability and controllability of the system matrices are required to demonstrate the existence of solutions of LMIs. Finally, the effectiveness of neural observers is verified on three simulation cases, including the X-29A aircraft model, the nonlinear pendulum, and the four-wheel steering vehicle.
    Brain Tumor Segmentation using Enhanced U-Net Model with Empirical Analysis. (arXiv:2210.13336v2 [eess.IV] UPDATED)
    Cancer of the brain is deadly and requires careful surgical segmentation. The brain tumors were segmented using U-Net using a Convolutional Neural Network (CNN). When looking for overlaps of necrotic, edematous, growing, and healthy tissue, it might be hard to get relevant information from the images. The 2D U-Net network was improved and trained with the BraTS datasets to find these four areas. U-Net can set up many encoder and decoder routes that can be used to get information from images that can be used in different ways. To reduce computational time, we use image segmentation to exclude insignificant background details. Experiments on the BraTS datasets show that our proposed model for segmenting brain tumors from MRI (MRI) works well. In this study, we demonstrate that the BraTS datasets for 2017, 2018, 2019, and 2020 do not significantly differ from the BraTS 2019 dataset's attained dice scores of 0.8717 (necrotic), 0.9506 (edema), and 0.9427 (enhancing).
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v2 [cs.CV] UPDATED)
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.
    Competition, Alignment, and Equilibria in Digital Marketplaces. (arXiv:2208.14423v2 [cs.GT] UPDATED)
    Competition between traditional platforms is known to improve user utility by aligning the platform's actions with user preferences. But to what extent is alignment exhibited in data-driven marketplaces? To study this question from a theoretical perspective, we introduce a duopoly market where platform actions are bandit algorithms and the two platforms compete for user participation. A salient feature of this market is that the quality of recommendations depends on both the bandit algorithm and the amount of data provided by interactions from users. This interdependency between the algorithm performance and the actions of users complicates the structure of market equilibria and their quality in terms of user utility. Our main finding is that competition in this market does not perfectly align market outcomes with user utility. Interestingly, market outcomes exhibit misalignment not only when the platforms have separate data repositories, but also when the platforms have a shared data repository. Nonetheless, the data sharing assumptions impact what mechanism drives misalignment and also affect the specific form of misalignment (e.g. the quality of the best-case and worst-case market outcomes). More broadly, our work illustrates that competition in digital marketplaces has subtle consequences for user utility that merit further investigation.
    Pattern Attention Transformer with Doughnut Kernel. (arXiv:2211.16961v2 [cs.CV] UPDATED)
    We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. In ViT, an image is cut into square-shaped patches. As the follow-up of ViT, Swin Transformer proposes an additional step of shifting to decrease the existence of fixed boundaries, which also incurs 'two connected Swin Transformer blocks' as the minimum unit of the model. Inheriting the patch/window idea, our doughnut kernel enhances the design of patches further. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels beyond square. To verify its performance on image classification, PAT is designed with Transformer blocks of regular octagon shape doughnut kernels. Its architecture is lighter: the minimum pattern attention layer is only one for each stage. Under similar complexity of computation, its performances on ImageNet 1K reach higher throughput (+10\%) and surpass Swin Transformer (+0.1 acc1).
    Breast Cancer Classification using Deep Learned Features Boosted with Handcrafted Features. (arXiv:2206.12815v2 [eess.IV] UPDATED)
    Breast cancer is one of the leading causes of death among women across the globe. It is difficult to treat if detected at advanced stages, however, early detection can significantly increase chances of survival and improves lives of millions of women. Given the widespread prevalence of breast cancer, it is of utmost importance for the research community to come up with the framework for early detection, classification and diagnosis. Artificial intelligence research community in coordination with medical practitioners are developing such frameworks to automate the task of detection. With the surge in research activities coupled with availability of large datasets and enhanced computational powers, it expected that AI framework results will help even more clinicians in making correct predictions. In this article, a novel framework for classification of breast cancer using mammograms is proposed. The proposed framework combines robust features extracted from novel Convolutional Neural Network (CNN) features with handcrafted features including HOG (Histogram of Oriented Gradients) and LBP (Local Binary Pattern). The obtained results on CBIS-DDSM dataset exceed state of the art.
    Quantized Training of Gradient Boosting Decision Trees. (arXiv:2207.09682v2 [cs.LG] UPDATED)
    Recent years have witnessed significant success in Gradient Boosting Decision Trees (GBDT) for a wide range of machine learning applications. Generally, a consensus about GBDT's training algorithms is gradients and statistics are computed based on high-precision floating points. In this paper, we investigate an essentially important question which has been largely ignored by the previous literature: how many bits are needed for representing gradients in training GBDT? To solve this mystery, we propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm. Surprisingly, both our theoretical analysis and empirical studies show that the necessary precisions of gradients without hurting any performance can be quite low, e.g., 2 or 3 bits. With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits. Promisingly, these findings may pave the way for much more efficient training of GBDT from several aspects: (1) speeding up the computation of gradient statistics in histograms; (2) compressing the communication cost of high-precision statistical information during distributed training; (3) the inspiration of utilization and development of hardware architectures which well support low-precision computation for GBDT training. Benchmarked on CPUs, GPUs, and distributed clusters, we observe up to 2$\times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets, demonstrating the effectiveness and potential of the low-precision training of GBDT. The code will be released to the official repository of LightGBM.
    Progressive Domain Adaptation with Contrastive Learning for Object Detection in the Satellite Imagery. (arXiv:2209.02564v2 [cs.CV] UPDATED)
    Images in aerial datasets are very large in resolution, and each frame contains many dense and small objects. State-of-the-art detection methods fail to capture small objects, local features, and region proposals for densely overlapped objects in aerial imagery due to the high variation of object sizes in satellite imagery with respect to the image size and high variation of content. Aerial imagery content varies greatly within the dataset due to the large change in lighting conditions, and the type of ground imagery captures from high altitudes. The variation is even higher between different datasets as object sizes, class distributions, image acquisition, and weather conditions can vary even more drastically. Thus, Domain Adaptation (DA) has been introduced as a band-aid to alleviate the degradation of object identification in previously unseen datasets. In this paper, we propose a small object detection pipeline that improves the feature extraction process by spatial pyramid pooling, cross-stage partial networks, heat-map-based region proposal network, and objects localization and identification through a novel image difficulty score that adapts the overall focal loss measure based on the image difficulty. Next, we propose novel contrastive learning with progressive domain adaptation to produce domain-invariant features across aerial datasets using local and global features. Effective analysis and illustration of different performance metrics and challenges show that our proposed method is comparable to the current State-of-Art models and creates a first-ever Domain Adaptation benchmark for the object detection task in highly imbalanced satellite datasets with large domain gaps and dominant small objects.
    Sequence Model Imitation Learning with Unobserved Contexts. (arXiv:2208.02225v3 [cs.LG] UPDATED)
    We consider imitation learning problems where the learner's ability to mimic the expert increases throughout the course of an episode as more information is revealed. One example of this is when the expert has access to privileged information: while the learner might not be able to accurately reproduce expert behavior early on in an episode, by considering the entire history of states and actions, they might be able to eventually identify the hidden context and act as the expert would. We prove that on-policy imitation learning algorithms (with or without access to a queryable expert) are better equipped to handle these sorts of asymptotically realizable problems than off-policy methods. This is because on-policy algorithms provably learn to recover from their initially suboptimal actions, while off-policy methods treat their suboptimal past actions as though they came from the expert. This often manifests as a latching behavior: a naive repetition of past actions. We conduct experiments in a toy bandit domain that show that there exist sharp phase transitions of whether off-policy approaches are able to match expert performance asymptotically, in contrast to the uniformly good performance of on-policy approaches. We demonstrate that on several continuous control tasks, on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.
    ViNL: Visual Navigation and Locomotion Over Obstacles. (arXiv:2210.14791v2 [cs.RO] UPDATED)
    We present Visual Navigation and Locomotion over obstacles (ViNL), which enables a quadrupedal robot to navigate unseen apartments while stepping over small obstacles that lie in its path (e.g., shoes, toys, cables), similar to how humans and pets lift their feet over objects as they walk. ViNL consists of: (1) a visual navigation policy that outputs linear and angular velocity commands that guides the robot to a goal coordinate in unfamiliar indoor environments; and (2) a visual locomotion policy that controls the robot's joints to avoid stepping on obstacles while following provided velocity commands. Both the policies are entirely "model-free", i.e. sensors-to-actions neural networks trained end-to-end. The two are trained independently in two entirely different simulators and then seamlessly co-deployed by feeding the velocity commands from the navigator to the locomotor, entirely "zero-shot" (without any co-training). While prior works have developed learning methods for visual navigation or visual locomotion, to the best of our knowledge, this is the first fully learned approach that leverages vision to accomplish both (1) intelligent navigation in new environments, and (2) intelligent visual locomotion that aims to traverse cluttered environments without disrupting obstacles. On the task of navigation to distant goals in unknown environments, ViNL using just egocentric vision significantly outperforms prior work on robust locomotion using privileged terrain maps (+32.8% success and -4.42 collisions per meter). Additionally, we ablate our locomotion policy to show that each aspect of our approach helps reduce obstacle collisions. Videos and code at this http URL
    Efficient Signed Graph Sampling via Balancing & Gershgorin Disc Perfect Alignment. (arXiv:2208.08726v2 [eess.SP] UPDATED)
    A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is exploited for graph filtering. However, existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. In this paper, we show that for datasets with strong inherent anti-correlations, a suitable graph contains both positive and negative edge weights. In response, we propose a linear-time signed graph sampling method centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix $\bar{\bf{C}}$, we first learn a sparse inverse matrix (graph Laplacian) $\mathcal{L}$ corresponding to a signed graph $\mathcal{G}$. We define the eigenvectors of Laplacian $\mathcal{L}_B$ for a balanced signed graph $\mathcal{G}_B$ -- approximating $\mathcal{G}$ via edge weight augmentation -- as graph frequency components. Next, we choose samples to minimize the low-pass filter reconstruction error in two steps. We first align all Gershgorin disc left-ends of Laplacian $\mathcal{L}_B$ at smallest eigenvalue $\lambda_{\min}(\mathcal{L}_B)$ via similarity transform $\mathcal{L}_p = \S \mathcal{L}_B \S^{-1}$, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on $\mathcal{L}_p$ using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experimental results show that our signed graph sampling method outperformed existing fast sampling schemes noticeably on various datasets.
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v2 [math.OC] UPDATED)
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.
    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v3 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
    Self-Supervised Learning for Anomalous Channel Detection in EEG Graphs: Application to Seizure Analysis. (arXiv:2208.07448v4 [cs.LG] UPDATED)
    Electroencephalogram (EEG) signals are effective tools towards seizure analysis where one of the most important challenges is accurate detection of seizure events and brain regions in which seizure happens or initiates. However, all existing machine learning-based algorithms for seizure analysis require access to the labeled seizure data while acquiring labeled data is very labor intensive, expensive, as well as clinicians dependent given the subjective nature of the visual qualitative interpretation of EEG signals. In this paper, we propose to detect seizure channels and clips in a self-supervised manner where no access to the seizure data is needed. The proposed method considers local structural and contextual information embedded in EEG graphs by employing positive and negative sub-graphs. We train our method through minimizing contrastive and generative losses. The employ of local EEG sub-graphs makes the algorithm an appropriate choice when accessing to the all EEG channels is impossible due to complications such as skull fractures. We conduct an extensive set of experiments on the largest seizure dataset and demonstrate that our proposed framework outperforms the state-of-the-art methods in the EEG-based seizure study. The proposed method is the only study that requires no access to the seizure data in its training phase, yet establishes a new state-of-the-art to the field, and outperforms all related supervised methods.
    Dilated Neighborhood Attention Transformer. (arXiv:2209.15001v3 [cs.CV] UPDATED)
    Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).
    Tightening Discretization-based MILP Models for the Pooling Problem using Upper Bounds on Bilinear Terms. (arXiv:2207.03699v2 [math.OC] UPDATED)
    Discretization-based methods have been proposed for solving nonconvex optimization problems with bilinear terms such as the pooling problem. These methods convert the original nonconvex optimization problems into mixed-integer linear programs (MILPs). In this paper we study tightening methods for these MILP models for the pooling problem, and derive valid constraints using upper bounds on bilinear terms. Computational results demonstrate the effectiveness of our methods in terms of reducing solution time.
    Learning with Muscles: Benefits for Data-Efficiency and Robustness in Anthropomorphic Tasks. (arXiv:2207.03952v2 [cs.RO] UPDATED)
    Humans are able to outperform robots in terms of robustness, versatility, and learning of new tasks in a wide variety of movements. We hypothesize that highly nonlinear muscle dynamics play a large role in providing inherent stability, which is favorable to learning. While recent advances have been made in applying modern learning techniques to muscle-actuated systems both in simulation as well as in robotics, so far, no detailed analysis has been performed to show the benefits of muscles when learning from scratch. Our study closes this gap and showcases the potential of muscle actuators for core robotics challenges in terms of data-efficiency, hyperparameter sensitivity, and robustness.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v3 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v3 [cs.LG] UPDATED)
    There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-sparse parity of $n$ bits, a canonical discrete search problem which is statistically easy but computationally hard. Empirically, we find that a variety of neural networks successfully learn sparse parities, with discontinuous phase transitions in the training curves. On small instances, learning abruptly occurs at approximately $n^{O(k)}$ iterations; this nearly matches SQ lower bounds, despite the apparent lack of a sparse prior. Our theoretical analysis shows that these observations are not explained by a Langevin-like mechanism, whereby SGD "stumbles in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time). Instead, we show that SGD gradually amplifies the sparse solution via a Fourier gap in the population gradient, making continual progress that is invisible to loss and error metrics.
    OpenXAI: Towards a Transparent Evaluation of Model Explanations. (arXiv:2206.11104v3 [cs.LG] UPDATED)
    While several types of post hoc explanation methods (e.g., feature attribution methods) have been proposed in recent literature, there is little to no work on systematically benchmarking these methods in an efficient and transparent manner. Here, we introduce OpenXAI, a comprehensive and extensible open source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, (ii) open-source implementations of twenty-two quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, and (iii) the first ever public XAI leaderboards to benchmark explanations. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. OpenXAI datasets and data loaders, implementations of state-of-the-art explanation methods and evaluation metrics, as well as leaderboards are publicly available at https://open-xai.github.io/.
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v2 [math.OC] UPDATED)
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the nonconvex minimax problems with coupled linear constraints.
    Learning Deep Input-Output Stable Dynamics. (arXiv:2206.13093v3 [math.DS] UPDATED)
    Learning stable dynamics from observed time-series data is an essential problem in robotics, physical modeling, and systems biology. Many of these dynamics are represented as an inputs-output system to communicate with the external environment. In this study, we focus on input-output stable systems, exhibiting robustness against unexpected stimuli and noise. We propose a method to learn nonlinear systems guaranteeing the input-output stability. Our proposed method utilizes the differentiable projection onto the space satisfying the Hamilton-Jacobi inequality to realize the input-output stability. The problem of finding this projection can be formulated as a quadratic constraint quadratic programming problem, and we derive the particular solution analytically. Also, we apply our method to a toy bistable model and the task of training a benchmark generated from a glucose-insulin simulator. The results show that the nonlinear system with neural networks by our method achieves the input-output stability, unlike naive neural networks. Our code is available at https://github.com/clinfo/DeepIOStability.
    Evaluating Explainability for Graph Neural Networks. (arXiv:2208.09339v2 [cs.LG] UPDATED)
    As post hoc explanations are increasingly used to understand the behavior of graph neural networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and several real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition to synthetic and real-world graph datasets with ground-truth explanations, GraphXAI provides data loaders, data processing functions, visualizers, GNN model implementations, and evaluation metrics to benchmark the performance of GNN explainability methods.
    On the effectiveness of persistent homology. (arXiv:2206.10551v3 [math.AT] UPDATED)
    Persistent homology (PH) is one of the most popular methods in Topological Data Analysis. Even though PH has been used in many different types of applications, the reasons behind its success remain elusive; in particular, it is not known for which classes of problems it is most effective, or to what extent it can detect geometric or topological features. The goal of this work is to identify some types of problems where PH performs well or even better than other methods in data analysis. We consider three fundamental shape analysis tasks: the detection of the number of holes, curvature and convexity from 2D and 3D point clouds sampled from shapes. Experiments demonstrate that PH is successful in these tasks, outperforming several baselines, including PointNet, an architecture inspired precisely by the properties of point clouds. In addition, we observe that PH remains effective for limited computational resources and limited training data, as well as out-of-distribution test data, including various data transformations and noise. For convexity detection, we provide a theoretical guarantee that PH is effective for this task in $\mathbb{R}^d$, and demonstrate the detection of a convexity measure on the FLAVIA data set of plant leaf images. Due to the crucial role of shape classification in understanding mathematical and physical structures and objects, and in many applications, the findings of this work will provide some knowledge about the types of problems that are appropriate for PH, so that it can - to borrow the words from Wigner 1960 - ``remain valid in future research, and extend, to our pleasure", but to our lesser bafflement, to a variety of applications.
    Continual Prune-and-Select: Class-incremental learning with specialized subnetworks. (arXiv:2208.04952v2 [cs.LG] UPDATED)
    The human brain is capable of learning tasks sequentially mostly without forgetting. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning one task after another. We address this challenge considering a class-incremental learning scenario where the DNN sees test data without knowing the task from which this data originates. During training, Continual-Prune-and-Select (CP&S) finds a subnetwork within the DNN that is responsible for solving a given task. Then, during inference, CP&S selects the correct subnetwork to make predictions for that task. A new task is learned by training available neuronal connections of the DNN (previously untrained) to create a new subnetwork by pruning, which can include previously trained connections belonging to other subnetwork(s) because it does not update shared connections. This enables to eliminate catastrophic forgetting by creating specialized regions in the DNN that do not conflict with each other while still allowing knowledge transfer across them. The CP&S strategy is implemented with different subnetwork selection strategies, revealing superior performance to state-of-the-art continual learning methods tested on various datasets (CIFAR-100, CUB-200-2011, ImageNet-100 and ImageNet-1000). In particular, CP&S is capable of sequentially learning 10 tasks from ImageNet-1000 keeping an accuracy around 94% with negligible forgetting, a first-of-its-kind result in class-incremental learning. To the best of the authors' knowledge, this represents an improvement in accuracy above 10% when compared to the best alternative method.
    AutoML Two-Sample Test. (arXiv:2206.08843v3 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.
    Towards Interpreting Vulnerability of Multi-Instance Learning via Customized and Universal Adversarial Perturbations. (arXiv:2211.17071v2 [cs.CV] UPDATED)
    Multiple-Instance Learning (MIL) is a recent machine learning paradigm which is immensely useful in various real-life applications, like image analysis, video anomaly detection, text classification, etc. It is well known that most of the existing machine learning classifiers are highly vulnerable to adversarial perturbations. Since MIL is a weakly supervised learning, where information is available for a set of instances, called bag and not for every instances, adversarial perturbations can be fatal. In this paper, we have proposed two adversarial perturbation methods to analyze the effect of adversarial perturbations to interpret the vulnerability of MIL methods. Out of the two algorithms, one can be customized for every bag, and the other is a universal one, which can affect all bags in a given data set and thus has some generalizability. Through simulations, we have also shown the effectiveness of the proposed algorithms to fool the state-of-the-art (SOTA) MIL methods. Finally, we have discussed through experiments, about taking care of these kind of adversarial perturbations through a simple strategy. Source codes are available at https://github.com/InkiInki/MI-UAP.
    NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. (arXiv:2206.04910v3 [cs.LG] UPDATED)
    The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratical complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.
    Improved Algorithms for Neural Active Learning. (arXiv:2210.00423v3 [cs.LG] UPDATED)
    We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. In particular, we introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work. Then, the proposed algorithm leverages the powerful representation of NNs for both exploitation and exploration, has the query decision-maker tailored for $k$-class classification problems with the performance guarantee, utilizes the full feedback, and updates parameters in a more practical and efficient manner. These careful designs lead to an instance-dependent regret upper bound, roughly improving by a multiplicative factor $O(\log T)$ and removing the curse of input dimensionality. Furthermore, we show that the algorithm can achieve the same performance as the Bayes-optimal classifier in the long run under the hard-margin setting in classification problems. In the end, we use extensive experiments to evaluate the proposed algorithm and SOTA baselines, to show the improved empirical performance.
    A Search-Based Testing Approach for Deep Reinforcement Learning Agents. (arXiv:2206.07813v2 [cs.SE] UPDATED)
    Deep Reinforcement Learning (DRL) algorithms have been increasingly employed during the last decade to solve various decision-making problems such as autonomous driving and robotics. However, these algorithms have faced great challenges when deployed in safety-critical environments since they often exhibit erroneous behaviors that can lead to potentially critical errors. One way to assess the safety of DRL agents is to test them to detect possible faults leading to critical failures during their execution. This raises the question of how we can efficiently test DRL policies to ensure their correctness and adherence to safety requirements. Most existing works on testing DRL agents use adversarial attacks that perturb states or actions of the agent. However, such attacks often lead to unrealistic states of the environment. Their main goal is to test the robustness of DRL agents rather than testing the compliance of agents' policies with respect to requirements. Due to the huge state space of DRL environments, the high cost of test execution, and the black-box nature of DRL algorithms, the exhaustive testing of DRL agents is impossible. In this paper, we propose a Search-based Testing Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL agent by effectively searching for failing executions of the agent within a limited testing budget. We use machine learning models and a dedicated genetic algorithm to narrow the search towards faulty episodes. We apply STARLA on Deep-Q-Learning agents which are widely used as benchmarks and show that it significantly outperforms Random Testing by detecting more faults related to the agent's policy. We also investigate how to extract rules that characterize faulty episodes of the DRL agent using our search results. Such rules can be used to understand the conditions under which the agent fails and thus assess its deployment risks.
    Long Range Graph Benchmark. (arXiv:2206.08164v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) that are based on the message passing (MP) paradigm generally exchange information between 1-hop neighbors to build node representations at each layer. In principle, such networks are not able to capture long-range interactions (LRI) that may be desired or necessary for learning a given task on graphs. Recently, there has been an increasing interest in development of Transformer-based methods for graphs that can consider full node connectivity beyond the original sparse structure, thus enabling the modeling of LRI. However, MP-GNNs that simply rely on 1-hop message passing often fare better in several existing graph benchmarks when combined with positional feature representations, among other innovations, hence limiting the perceived utility and ranking of Transformer-like architectures. Here, we present the Long Range Graph Benchmark (LRGB) with 5 graph learning datasets: PascalVOC-SP, COCO-SP, PCQM-Contact, Peptides-func and Peptides-struct that arguably require LRI reasoning to achieve strong performance in a given task. We benchmark both baseline GNNs and Graph Transformer networks to verify that the models which capture long-range dependencies perform significantly better on these tasks. Therefore, these datasets are suitable for benchmarking and exploration of MP-GNNs and Graph Transformer architectures that are intended to capture LRI.
    Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms. (arXiv:2209.00735v2 [cs.LG] UPDATED)
    Neural networks (NNs) struggle to efficiently solve certain problems, such as learning parities, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized program. For example, on parity problems, the NN learns as well as Gaussian elimination, an efficient algorithm that can be succinctly described. Our architecture combines both recurrent weight sharing between layers and convolutional weight sharing to reduce the number of parameters down to a constant, even though the network itself may have trillions of nodes. While in practice the constants in our analysis are too large to be directly meaningful, our work suggests that the synergy of Recurrent and Convolutional NNs (RCNNs) may be more natural and powerful than either alone, particularly for concisely parameterizing discrete algorithms.
    RenyiCL: Contrastive Representation Learning with Skew Renyi Divergence. (arXiv:2208.06270v2 [stat.ML] UPDATED)
    Contrastive representation learning seeks to acquire useful representations by estimating the shared information between multiple views of data. Here, the choice of data augmentation is sensitive to the quality of learned representations: as harder the data augmentations are applied, the views share more task-relevant information, but also task-irrelevant one that can hinder the generalization capability of representation. Motivated by this, we present a new robust contrastive learning scheme, coined R\'enyiCL, which can effectively manage harder augmentations by utilizing R\'enyi divergence. Our method is built upon the variational lower bound of R\'enyi divergence, but a na\"ive usage of a variational method is impractical due to the large variance. To tackle this challenge, we propose a novel contrastive objective that conducts variational estimation of a skew R\'enyi divergence and provide a theoretical guarantee on how variational estimation of skew divergence leads to stable training. We show that R\'enyi contrastive learning objectives perform innate hard negative sampling and easy positive sampling simultaneously so that it can selectively learn useful features and ignore nuisance features. Through experiments on ImageNet, we show that R\'enyi contrastive learning with stronger augmentations outperforms other self-supervised methods without extra regularization or computational overhead. Moreover, we also validate our method on other domains such as graph and tabular, showing empirical gain over other contrastive methods.
    The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization. (arXiv:2205.14546v2 [cs.LG] UPDATED)
    Machine learning models often generalize poorly to out-of-distribution (OOD) data as a result of relying on features that are spuriously correlated with the label during training. Recently, the technique of Invariant Risk Minimization (IRM) was proposed to learn predictors that only use invariant features by conserving the feature-conditioned label expectation $\mathbb{E}_e[y|f(x)]$ across environments. However, more recent studies have demonstrated that IRM-v1, a practical version of IRM, can fail in various settings. Here, we identify a fundamental flaw of IRM formulation that causes the failure. We then introduce a complementary notion of invariance, MRI, based on conserving the label-conditioned feature expectation $\mathbb{E}_e[f(x)|y]$, which is free of this flaw. Further, we introduce a simplified, practical version of the MRI formulation called MRI-v1. We prove that for general linear problems, MRI-v1 guarantees invariant predictors given sufficient number of environments. We also empirically demonstrate that MRI-v1 strongly out-performs IRM-v1 and consistently achieves near-optimal OOD generalization in image-based nonlinear problems.
    Counterfactual Supervision-based Information Bottleneck for Out-of-Distribution Generalization. (arXiv:2208.07798v3 [cs.LG] UPDATED)
    Learning invariant (causal) features for out-of-distribution (OOD) generalization has attracted extensive attention recently, and among the proposals invariant risk minimization (IRM) is a notable solution. In spite of its theoretical promise for linear regression, the challenges of using IRM in linear classification problems remain. By introducing the information bottleneck (IB) principle into the learning of IRM, IB-IRM approach has demonstrated its power to solve these challenges. In this paper, we further improve IB-IRM from two aspects. First, we show that the key assumption of support overlap of invariant features used in IB-IRM is strong for the guarantee of OOD generalization and it is still possible to achieve the optimal solution without this assumption. Second, we illustrate two failure modes that IB-IRM (and IRM) could fail for learning the invariant features, and to address such failures, we propose a \textit{Counterfactual Supervision-based Information Bottleneck (CSIB)} learning algorithm that provably recovers the invariant features. By requiring counterfactual inference, CSIB works even when accessing data from a single environment. Empirical experiments on several datasets verify our theoretical results.
    Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards. (arXiv:2206.01293v2 [cs.LG] UPDATED)
    Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent of interest.
    SMPL: Simulated Industrial Manufacturing and Process Control Learning Environments. (arXiv:2206.08851v2 [cs.LG] UPDATED)
    Traditional biological and pharmaceutical manufacturing plants are controlled by human workers or pre-defined thresholds. Modernized factories have advanced process control algorithms such as model predictive control (MPC). However, there is little exploration of applying deep reinforcement learning to control manufacturing plants. One of the reasons is the lack of high fidelity simulations and standard APIs for benchmarking. To bridge this gap, we develop an easy-to-use library that includes five high-fidelity simulation environments: BeerFMTEnv, ReactorEnv, AtropineEnv, PenSimEnv and mAbEnv, which cover a wide range of manufacturing processes. We build these environments on published dynamics models. Furthermore, we benchmark online and offline, model-based and model-free reinforcement learning algorithms for comparisons of follow-up research.
    Recipe for a General, Powerful, Scalable Graph Transformer. (arXiv:2205.12454v4 [cs.LG] UPDATED)
    We propose a recipe on how to build a general, powerful, scalable (GPS) graph Transformer with linear complexity and state-of-the-art results on a diverse set of benchmarks. Graph Transformers (GTs) have gained popularity in the field of graph representation learning with a variety of recent publications but they lack a common foundation about what constitutes a good positional or structural encoding, and what differentiates them. In this paper, we summarize the different types of encodings with a clearer definition and categorize them as being $\textit{local}$, $\textit{global}$ or $\textit{relative}$. The prior GTs are constrained to small graphs with a few hundred nodes, here we propose the first architecture with a complexity linear in the number of nodes and edges $O(N+E)$ by decoupling the local real-edge aggregation from the fully-connected Transformer. We argue that this decoupling does not negatively affect the expressivity, with our architecture being a universal function approximator on graphs. Our GPS recipe consists of choosing 3 main ingredients: (i) positional/structural encoding, (ii) local message-passing mechanism, and (iii) global attention mechanism. We provide a modular framework $\textit{GraphGPS}$ that supports multiple types of encodings and that provides efficiency and scalability both in small and large graphs. We test our architecture on 16 benchmarks and show highly competitive results in all of them, show-casing the empirical benefits gained by the modularity and the combination of different strategies.
    Investigation of a Machine learning methodology for the SKA pulsar search pipeline. (arXiv:2209.04430v3 [astro-ph.IM] UPDATED)
    The SKA pulsar search pipeline will be used for real time detection of pulsars. Modern radio telescopes such as SKA will be generating petabytes of data in their full scale of operation. Hence experience-based and data-driven algorithms become indispensable for applications such as candidate detection. Here we describe our findings from testing a state of the art object detection algorithm called Mask R-CNN to detect candidate signatures in the SKA pulsar search pipeline. We have trained the Mask R-CNN model to detect candidate images. A custom annotation tool was developed to mark the regions of interest in large datasets efficiently. We have successfully demonstrated this algorithm by detecting candidate signatures on a simulation dataset. The paper presents details of this work with a highlight on the future prospects.
    Tighter Regret Analysis and Optimization of Online Federated Learning. (arXiv:2205.06491v3 [cs.LG] UPDATED)
    In federated learning (FL), it is commonly assumed that all data are placed at clients in the beginning of machine learning (ML) optimization (i.e., offline learning). However, in many real-world applications, it is expected to proceed in an online fashion. To this end, online FL (OFL) has been introduced, which aims at learning a sequence of global models from decentralized streaming data such that the so-called cumulative regret is minimized. Combining online gradient descent and model averaging, in this framework, FedOGD is constructed as the counterpart of FedSGD in FL. While it can enjoy an optimal sublinear regret, FedOGD suffers from heavy communication costs. In this paper, we present a communication-efficient method (named OFedIQ) by means of intermittent transmission (enabled by client subsampling and periodic transmission) and quantization. For the first time, we derive the regret bound that captures the impact of data-heterogeneity and the communication-efficient techniques. Through this, we efficiently optimize the parameters of OFedIQ such as sampling rate, transmission period, and quantization levels. Also, it is proved that the optimized OFedIQ can asymptotically achieve the performance of FedOGD while reducing the communication costs by 99%. Via experiments with real datasets, we demonstrate the effectiveness of the optimized OFedIQ.
    Split-kl and PAC-Bayes-split-kl Inequalities for Ternary Random Variables. (arXiv:2206.00706v2 [stat.ML] UPDATED)
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality is particularly well-suited for ternary random variables, which naturally show up in a variety of problems, including analysis of excess losses in classification, analysis of weighted majority votes, and learning with abstention. We demonstrate that for ternary random variables the inequality is simultaneously competitive with the kl inequality, the Empirical Bernstein inequality, and the Unexpected Bernstein inequality, and in certain regimes outperforms all of them. It resolves an open question by Tolstikhin and Seldin [2013] and Mhammedi et al. [2019] on how to match simultaneously the combinatorial power of the kl inequality when the distribution happens to be close to binary and the power of Bernstein inequalities to exploit low variance when the probability mass is concentrated on the middle value. We also derive a PAC-Bayes-split-kl inequality and compare it with the PAC-Bayes-kl, PAC-Bayes-Empirical-Bennett, and PAC-Bayes-Unexpected-Bernstein inequalities in an analysis of excess losses and in an analysis of a weighted majority vote for several UCI datasets. Last but not least, our study provides the first direct comparison of the Empirical Bernstein and Unexpected Bernstein inequalities and their PAC-Bayes extensions.
    Bugs in Machine Learning-based Systems: A Faultload Benchmark. (arXiv:2206.12311v2 [cs.SE] UPDATED)
    The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v5 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v3 [cs.LG] UPDATED)
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.
    TaSIL: Taylor Series Imitation Learning. (arXiv:2205.14812v2 [cs.LG] UPDATED)
    We propose Taylor Series Imitation Learning (TaSIL), a simple augmentation to standard behavior cloning losses in the context of continuous control. TaSIL penalizes deviations in the higher-order Taylor series terms between the learned and expert policies. We show that experts satisfying a notion of $\textit{incremental input-to-state stability}$ are easy to learn, in the sense that a small TaSIL-augmented imitation loss over expert trajectories guarantees a small imitation loss over trajectories generated by the learned policy. We provide sample-complexity bounds for TaSIL that scale as $\tilde{\mathcal{O}}(1/n)$ in the realizable setting, for $n$ the number of expert demonstrations. Finally, we demonstrate experimentally the relationship between the robustness of the expert policy and the order of Taylor expansion required in TaSIL, and compare standard Behavior Cloning, DART, and DAgger with TaSIL-loss-augmented variants. In all cases, we show significant improvement over baselines across a variety of MuJoCo tasks.
    Bayesian Active Learning with Fully Bayesian Gaussian Processes. (arXiv:2205.10186v3 [cs.LG] UPDATED)
    The bias-variance trade-off is a well-known problem in machine learning that only gets more pronounced the less available data there is. In active learning, where labeled data is scarce or difficult to obtain, neglecting this trade-off can cause inefficient and non-optimal querying, leading to unnecessary data labeling. In this paper, we focus on active learning with Gaussian Processes (GPs). For the GP, the bias-variance trade-off is made by optimization of the two hyperparameters: the length scale and noise-term. Considering that the optimal mode of the joint posterior of the hyperparameters is equivalent to the optimal bias-variance trade-off, we approximate this joint posterior and utilize it to design two new acquisition functions. The first one is a Bayesian variant of Query-by-Committee (B-QBC), and the second is an extension that explicitly minimizes the predictive variance through a Query by Mixture of Gaussian Processes (QB-MGP) formulation. Across six simulators, we empirically show that B-QBC, on average, achieves the best marginal likelihood, whereas QB-MGP achieves the best predictive performance. We show that incorporating the bias-variance trade-off in the acquisition functions mitigates unnecessary and expensive data labeling.
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v3 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.
    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v5 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v4 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.
    TransBoost: Improving the Best ImageNet Performance using Deep Transduction. (arXiv:2205.13331v4 [cs.CV] UPDATED)
    This paper deals with deep transductive learning, and proposes TransBoost as a procedure for fine-tuning any deep neural model to improve its performance on any (unlabeled) test set provided at training time. TransBoost is inspired by a large margin principle and is efficient and simple to use. Our method significantly improves the ImageNet classification performance on a wide range of architectures, such as ResNets, MobileNetV3-L, EfficientNetB0, ViT-S, and ConvNext-T, leading to state-of-the-art transductive performance. Additionally we show that TransBoost is effective on a wide variety of image classification datasets. The implementation of TransBoost is provided at: https://github.com/omerb01/TransBoost .
    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?. (arXiv:2206.05266v4 [cs.LG] UPDATED)
    We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform evolutionary searches to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. After evaluating these approaches together in multiple different environments including a real-world robot environment, we confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we conduct the ablation study on multiple factors and demonstrate the properties of representations learned with different approaches.
    Fast Bayesian Coresets via Subsampling and Quasi-Newton Refinement. (arXiv:2203.09675v3 [stat.ML] UPDATED)
    Bayesian coresets approximate a posterior distribution by building a small weighted subset of the data points. Any inference procedure that is too computationally expensive to be run on the full posterior can instead be run inexpensively on the coreset, with results that approximate those on the full data. However, current approaches are limited by either a significant run-time or the need for the user to specify a low-cost approximation to the full posterior. We propose a Bayesian coreset construction algorithm that first selects a uniformly random subset of data, and then optimizes the weights using a novel quasi-Newton method. Our algorithm is a simple to implement, black-box method, that does not require the user to specify a low-cost posterior approximation. It is the first to come with a general high-probability bound on the KL divergence of the output coreset posterior. Experiments demonstrate that our method provides significant improvements in coreset quality against alternatives with comparable construction times, with far less storage cost and user input required.
    Curriculum Learning for Goal-Oriented Semantic Communications with a Common Language. (arXiv:2204.10429v2 [cs.NI] UPDATED)
    Goal-oriented semantic communication will be a pillar of next-generation wireless networks. Despite significant recent efforts in this area, most prior works are focused on specific data types (e.g., image or audio), and they ignore the goal and effectiveness aspects of semantic transmissions. In contrast, in this paper, a holistic goal-oriented semantic communication framework is proposed to enable a speaker and a listener to cooperatively execute a set of sequential tasks in a dynamic environment. A common language based on a hierarchical belief set is proposed to enable semantic communications between speaker and listener. The speaker, acting as an observer of the environment, utilizes the beliefs to transmit an initial description of its observation (called event) to the listener. The listener is then able to infer on the transmitted description and complete it by adding related beliefs to the transmitted beliefs of the speaker. As such, the listener reconstructs the observed event based on the completed description, and it then takes appropriate action in the environment based on the reconstructed event. An optimization problem is defined to determine the perfect and abstract description of the events while minimizing the transmission and inference costs with constraints on the task execution time and belief efficiency. Then, a novel bottom-up curriculum learning (CL) framework based on reinforcement learning is proposed to solve the optimization problem and enable the speaker and listener to gradually identify the structure of the belief set and the perfect and abstract description of the events. Simulation results show that the proposed CL method outperforms traditional RL in terms of convergence time, task execution cost and time, reliability, and belief efficiency.
    Learning Neural Acoustic Fields. (arXiv:2204.00628v2 [cs.SD] UPDATED)
    Our environment is filled with rich and dynamic acoustic information. When we walk into a cathedral, the reverberations as much as appearance inform us of the sanctuary's wide open space. Similarly, as an object moves around us, we expect the sound emitted to also exhibit this movement. While recent advances in learned implicit functions have led to increasingly higher quality representations of the visual world, there have not been commensurate advances in learning spatial auditory representations. To address this gap, we introduce Neural Acoustic Fields (NAFs), an implicit representation that captures how sounds propagate in a physical scene. By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds. We demonstrate that the continuous nature of NAFs enables us to render spatial acoustics for a listener at an arbitrary location, and can predict sound propagation at novel locations. We further show that the representation learned by NAFs can help improve visual learning with sparse views. Finally, we show that a representation informative of scene structure emerges during the learning of NAFs.
    Multi-sensor large-scale dataset for multi-view 3D reconstruction. (arXiv:2203.06111v2 [cs.CV] UPDATED)
    We present a new multi-sensor dataset for multi-view 3D surface reconstruction. It includes registered RGB and depth data from sensors of different resolutions and modalities: smartphones, Intel RealSense, Microsoft Kinect, industrial cameras, and structured-light scanner. The data for each scene is obtained under a large number of lighting conditions, and the scenes are selected to emphasize a diverse set of material properties challenging for existing algorithms. Overall, we provide around 1.4 million images of 107 different scenes acquired at 14 lighting conditions from 100 viewing directions. We expect our dataset will be useful for evaluation and training of 3D reconstruction algorithms of different types and for other related tasks.
    Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments. (arXiv:2204.03140v2 [cs.RO] UPDATED)
    Autonomous exploration has many important applications. However, classic information gain-based or frontier-based exploration only relies on the robot current state to determine the immediate exploration goal, which lacks the capability of predicting the value of future states and thus leads to inefficient exploration decisions. This paper presents a method to learn how "good" states are, measured by the state value function, to provide a guidance for robot exploration in real-world challenging environments. We formulate our work as a off-policy evaluation (OPE) problem for robot exploration (OPERE). It consists of offline Monte-Carlo training on real-world data and performs Temporal Difference (TD) online adaptation to optimize the trained value estimator. We also design an intrinsic reward function based on sensor information coverage to enable the robot to gain more information with sparse extrinsic rewards. Results demonstrate that our method enables the robot to predict the value of future states so as to better guide robot exploration. The proposed algorithm achieves better prediction performance compared with other state-of-the-art OPE methods. To the best of our knowledge, this work for the first time demonstrates value function prediction on real-world dataset for robot exploration in challenging subterranean and urban environments. More details and demo videos can be found at https://jeffreyyh.github.io/opere/.
    coVariance Neural Networks. (arXiv:2205.15856v4 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.
    Recipes for when Physics Fails: Recovering Robust Learning of Physics Informed Neural Networks. (arXiv:2110.13330v2 [cs.LG] UPDATED)
    Physics-informed Neural Networks (PINNs) have been shown to be effective in solving partial differential equations by capturing the physics induced constraints as a part of the training loss function. This paper shows that a PINN can be sensitive to errors in training data and overfit itself in dynamically propagating these errors over the domain of the solution of the PDE. It also shows how physical regularizations based on continuity criteria and conservation laws fail to address this issue and rather introduce problems of their own causing the deep network to converge to a physics-obeying local minimum instead of the global minimum. We introduce Gaussian Process (GP) based smoothing that recovers the performance of a PINN and promises a robust architecture against noise/errors in measurements. Additionally, we illustrate an inexpensive method of quantifying the evolution of uncertainty based on the variance estimation of GPs on boundary data. Robust PINN performance is also shown to be achievable by choice of sparse sets of inducing points based on sparsely induced GPs. We demonstrate the performance of our proposed methods and compare the results from existing benchmark models in literature for time-dependent Schr\"odinger and Burgers' equations.
    Adaptive Composite Online Optimization: Predictions in Static and Dynamic Environments. (arXiv:2205.00446v2 [math.OC] UPDATED)
    In the past few years, Online Convex Optimization (OCO) has received notable attention in the control literature thanks to its flexible real-time nature and powerful performance guarantees. In this paper, we propose new step-size rules and OCO algorithms that simultaneously exploit gradient predictions, function predictions and dynamics, features particularly pertinent to control applications. The proposed algorithms enjoy static and dynamic regret bounds in terms of the dynamics of the reference action sequence, gradient prediction error, and function prediction error, which are generalizations of known regularity measures from the literature. We present results for both convex and strongly convex costs. We validate the performance of the proposed algorithms in a trajectory tracking case study, as well as portfolio optimization using real-world datasets.
    Toward Explainable AI for Regression Models. (arXiv:2112.11407v2 [cs.LG] UPDATED)
    In addition to the impressive predictive power of machine learning (ML) models, more recently, explanation methods have emerged that enable an interpretation of complex non-linear learning models such as deep neural networks. Gaining a better understanding is especially important e.g. for safety-critical ML applications or medical diagnostics etc. While such Explainable AI (XAI) techniques have reached significant popularity for classifiers, so far little attention has been devoted to XAI for regression models (XAIR). In this review, we clarify the fundamental conceptual differences of XAI for regression and classification tasks, establish novel theoretical insights and analysis for XAIR, provide demonstrations of XAIR on genuine practical regression problems, and finally discuss the challenges remaining for the field.
    Generalization Error Bounds for Multiclass Sparse Linear Classifiers. (arXiv:2204.06264v2 [math.ST] UPDATED)
    We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.
    Smoothed Online Combinatorial Optimization Using Imperfect Predictions. (arXiv:2204.10979v2 [cs.LG] UPDATED)
    Smoothed online combinatorial optimization considers a learner who repeatedly chooses a combinatorial decision to minimize an unknown changing cost function with a penalty on switching decisions in consecutive rounds. We study smoothed online combinatorial optimization problems when an imperfect predictive model is available, where the model can forecast the future cost functions with uncertainty. We show that using predictions to plan for a finite time horizon leads to regret dependent on the total predictive uncertainty and an additional switching cost. This observation suggests choosing a suitable planning window to balance between uncertainty and switching cost, which leads to an online algorithm with guarantees on the upper and lower bounds of the cumulative regret. Empirically, our algorithm shows a significant improvement in cumulative regret compared to other baselines in synthetic online distributed streaming problems.
    3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction. (arXiv:2205.14575v2 [cs.CV] UPDATED)
    Recently, the transformer model has been successfully employed for the multi-view 3D reconstruction problem. However, challenges remain on designing an attention mechanism to explore the multiview features and exploit their relations for reinforcing the encoding-decoding modules. This paper proposes a new model, namely 3D coarse-to-fine transformer (3D-C2FT), by introducing a novel coarse-to-fine(C2F) attention mechanism for encoding multi-view features and rectifying defective 3D objects. C2F attention mechanism enables the model to learn multi-view information flow and synthesize 3D surface correction in a coarse to fine-grained manner. The proposed model is evaluated by ShapeNet and Multi-view Real-life datasets. Experimental results show that 3D-C2FT achieves notable results and outperforms several competing models on these datasets.
    Compositional Visual Generation with Composable Diffusion Models. (arXiv:2206.01714v6 [cs.CV] UPDATED)
    Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation. Project page: https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/
    Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health. (arXiv:2202.00916v3 [cs.LG] UPDATED)
    This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. However, prior works often learn the model by maximizing the predictive accuracy instead of final RMAB solution quality, causing a mismatch between training and evaluation objectives. To address this shortcoming, we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality. We present three key contributions: (i) we establish differentiability of the Whittle index policy to support decision-focused learning; (ii) we significantly improve the scalability of decision-focused learning approaches in sequential problems, specifically RMAB problems; (iii) we apply our algorithm to a previously collected dataset of maternal and child health to demonstrate its performance. Indeed, our algorithm is the first for decision-focused learning in RMAB that scales to real-world problem sizes.
    Handling Bias in Toxic Speech Detection: A Survey. (arXiv:2202.00126v3 [cs.SI] UPDATED)
    Detecting online toxicity has always been a challenge due to its inherent subjectivity. Factors such as the context, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can thus lead to a sidelining of the various groups they aim to help in the first place. It has piqued researchers' interest in examining unintended biases and their mitigation. Due to the nascent and multi-faceted nature of the work, complete literature is chaotic in its terminologies, techniques, and findings. In this paper, we put together a systematic study of the limitations and challenges of existing methods for mitigating bias in toxicity detection. We look closely at proposed methods for evaluating and mitigating bias in toxic speech detection. To examine the limitations of existing methods, we also conduct a case study to introduce the concept of bias shift due to knowledge-based bias mitigation. The survey concludes with an overview of the critical challenges, research gaps, and future directions. While reducing toxicity on online platforms continues to be an active area of research, a systematic study of various biases and their mitigation strategies will help the research community produce robust and fair models.
    A Ranking Game for Imitation Learning. (arXiv:2202.03481v3 [cs.LG] UPDATED)
    We propose a new framework for imitation learning -- treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. Project video and code can be found at https://hari-sikchi.github.io/rank-game/
    Dynamic Combination of Heterogeneous Models for Hierarchical Time Series. (arXiv:2112.11669v2 [cs.LG] UPDATED)
    We introduce a framework to dynamically combine heterogeneous models called \texttt{DYCHEM}, which forecasts a set of time series that are related through an aggregation hierarchy. Different types of forecasting models can be employed as individual ``experts'' so that each model is tailored to the nature of the corresponding time series. \texttt{DYCHEM} learns hierarchical structures during the training stage to help generalize better across all the time series being modeled and also mitigates coherency issues that arise due to constraints imposed by the hierarchy. To improve the reliability of forecasts, we construct quantile estimations based on the point forecasts obtained from combined heterogeneous models. The resulting quantile forecasts are coherent and independent of the choice of forecasting models. We conduct a comprehensive evaluation of both point and quantile forecasts for hierarchical time series (HTS), including public data and user records from a large financial software company. In general, our method is robust, adaptive to datasets with different properties, and highly configurable and efficient for large-scale forecasting pipelines.
    $A^{3}D$: A Platform of Searching for Robust Neural Architectures and Efficient Adversarial Attacks. (arXiv:2203.03128v2 [cs.LG] UPDATED)
    The robustness of deep neural networks (DNN) models has attracted increasing attention due to the urgent need for security in many applications. Numerous existing open-sourced tools or platforms are developed to evaluate the robustness of DNN models by ensembling the majority of adversarial attack or defense algorithms. Unfortunately, current platforms do not possess the ability to optimize the architectures of DNN models or the configuration of adversarial attacks to further enhance the robustness of models or the performance of adversarial attacks. To alleviate these problems, in this paper, we first propose a novel platform called auto adversarial attack and defense ($A^{3}D$), which can help search for robust neural network architectures and efficient adversarial attacks. In $A^{3}D$, we employ multiple neural architecture search methods, which consider different robustness evaluation metrics, including four types of noises: adversarial noise, natural noise, system noise, and quantified metrics, resulting in finding robust architectures. Besides, we propose a mathematical model for auto adversarial attack, and provide multiple optimization algorithms to search for efficient adversarial attacks. In addition, we combine auto adversarial attack and defense together to form a unified framework. Among auto adversarial defense, the searched efficient attack can be used as the new robustness evaluation to further enhance the robustness. In auto adversarial attack, the searched robust architectures can be utilized as the threat model to help find stronger adversarial attacks. Experiments on CIFAR10, CIFAR100, and ImageNet datasets demonstrate the feasibility and effectiveness of the proposed platform, which can also provide a benchmark and toolkit for researchers in the application of automated machine learning in evaluating and improving the DNN model robustnesses.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Samples and Its Application to Tumour Segmentation for Breast Cancer. (arXiv:2110.10325v8 [cs.LG] UPDATED)
    Recent studies have demonstrated the effectiveness of the combination of machine learning and logical reasoning, including data-driven logical reasoning, knowledge driven machine learning and abductive learning, in inventing advanced artificial intelligence technologies. One-step abductive multi-target learning (OSAMTL), an approach inspired by abductive learning, via simply combining machine learning and logical reasoning in a one-step balanced way, has as well shown its effectiveness in handling complex noisy labels of a single noisy sample in medical histopathology whole slide image analysis (MHWSIA). However, OSAMTL is not suitable for the situation where diverse noisy samples (DiNS) are provided for a learning task. In this paper, giving definition of DiNS, we propose one-step abductive multi-target learning with DiNS (OSAMTL-DiNS) to expand the original OSAMTL to handle complex noisy labels of DiNS. Applying OSAMTL-DiNS to tumour segmentation for breast cancer in MHWSIA, we show that OSAMTL-DiNS is able to enable various state-of-the-art approaches for learning from noisy labels to achieve more rational predictions.
    Gap Minimization for Knowledge Sharing and Transfer. (arXiv:2201.11231v2 [cs.LG] UPDATED)
    Learning from multiple related tasks by knowledge sharing and transfer has become increasingly relevant over the last two decades. In order to successfully transfer information from one task to another, it is critical to understand the similarities and differences between the domains. In this paper, we introduce the notion of \emph{performance gap}, an intuitive and novel measure of the distance between learning tasks. Unlike existing measures which are used as tools to bound the difference of expected risks between tasks (e.g., $\mathcal{H}$-divergence or discrepancy distance), we theoretically show that the performance gap can be viewed as a data- and algorithm-dependent regularizer, which controls the model complexity and leads to finer guarantees. More importantly, it also provides new insights and motivates a novel principle for designing strategies for knowledge sharing and transfer: gap minimization. We instantiate this principle with two algorithms: 1. gapBoost, a novel and principled boosting algorithm that explicitly minimizes the performance gap between source and target domains for transfer learning; and 2. gapMTNN, a representation learning algorithm that reformulates gap minimization as semantic conditional matching for multitask learning. Our extensive evaluation on both transfer learning and multitask learning benchmark data sets shows that our methods outperform existing baselines.
    Forecasting Market Changes using Variational Inference. (arXiv:2205.00605v2 [q-fin.ST] UPDATED)
    Though various approaches have been considered, forecasting near-term market changes of equities and similar market data remains quite difficult. In this paper we introduce an approach to forecast near-term market changes for equity indices as well as portfolios using variational inference (VI). VI is a machine learning approach which uses optimization techniques to estimate complex probability densities. In the proposed approach, clusters of explanatory variables are identified and market changes are forecast based on cluster-specific linear regression. Apart from the expected value of changes, the proposed approach can also be used to obtain the distribution of possible outcomes. Another advantage of the proposed approach is the clear model interpretation, as clusters of explanatory variables (or market regimes) are identified for which the future changes follow similar relationships. Knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. An illustrative example of predicting one-day S\&P change is considered and it is shown that even with as few as three explanatory variables, the proposed approach provides useful predictions.
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v3 [cond-mat.stat-mech] UPDATED)
    We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo (NMCMC) simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We discuss several estimates of autocorrelation times in the context of NMCMC, some inspired by analytical results derived for the Metropolized Independent Sampler (MIS). We check their reliability by estimating them on a small system where analytical results can also be obtained. Based on the analytical results for MIS we propose a new loss function and study its impact on the autocorelation times. Although, this function's performance is a bit inferior to the traditional Kullback-Leibler divergence, it offers two training algorithms which in some situations may be beneficial. By studying a small, $4 \times 4$, system we gain access to the dynamics of the training process which we visualize using several observables. Furthermore, we quantitatively investigate the impact of imposing global discrete symmetries of the system in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates which considerably improves the quality of the training. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guidance to the implementation of Neural Markov Chain Monte Carlo simulations for more complicated models.
    Federated Learning with Heterogeneous Differential Privacy. (arXiv:2110.15252v2 [cs.LG] UPDATED)
    Federated learning (FL) takes a first step towards privacy-preserving machine learning by training models while keeping client data local. Models trained using FL may still leak private client information through model updates during training. Differential privacy (DP) may be employed on model updates to provide privacy guarantees within FL, typically at the cost of degraded performance of the final trained model. Both non-private FL and DP-FL can be solved using variants of the federated averaging (FedAvg) algorithm. In this work, we consider a heterogeneous DP setup where clients require varying degrees of privacy guarantees. First, we analyze the optimal solution to the federated linear regression problem with heterogeneous DP in a Bayesian setup. We find that unlike the non-private setup, where the optimal solution for homogeneous data amounts to a single global solution for all clients learned through FedAvg, the optimal solution for each client in this setup would be a personalized one even for homogeneous data. We also analyze the privacy-utility trade-off for this setup, where we characterize the gain obtained from heterogeneous privacy where some clients opt for less strict privacy guarantees. We propose a new algorithm for FL with heterogeneous DP, named FedHDP, which employs personalization and weighted averaging at the server using the privacy choices of clients, to achieve better performance on clients' local models. Through numerical experiments, we show that FedHDP provides up to $9.27\%$ performance gain compared to the baseline DP-FL for the considered datasets where $5\%$ of clients opt out of DP. Additionally, we show a gap in the average performance of local models between non-private and private clients of up to $3.49\%$, empirically illustrating that the baseline DP-FL might incur a large utility cost when not all clients require the stricter privacy guarantees.
    MANDERA: Malicious Node Detection in Federated Learning via Ranking. (arXiv:2110.11736v2 [cs.LG] UPDATED)
    Byzantine attacks hinder the deployment of federated learning algorithms. Although we know that the benign gradients and Byzantine attacked gradients are distributed differently, to detect the malicious gradients is challenging due to (1) the gradient is high-dimensional and each dimension has its unique distribution and (2) the benign gradients and the attacked gradients are always mixed (two-sample test methods cannot apply directly). To address the above, for the first time, we propose MANDERA which is theoretically guaranteed to efficiently detect all malicious gradients under Byzantine attacks with no prior knowledge or history about the number of attacked nodes. More specifically, we transfer the original updating gradient space into a ranking matrix. By such an operation, the scales of different dimensions of the gradients in the ranking space become identical. The high-dimensional benign gradients and the malicious gradients can be easily separated. The effectiveness of MANDERA is further confirmed by experimentation on four Byzantine attack implementations (Gaussian, Zero Gradient, Sign Flipping, Shifted Mean), comparing with state-of-the-art defenses. The experiments cover both IID and Non-IID datasets.
    A Deep Reinforcement Learning Approach for Online Parcel Assignment. (arXiv:2109.03467v2 [cs.LG] UPDATED)
    In this paper, we investigate the online parcel assignment (OPA) problem, in which each stochastically generated parcel needs to be assigned to a candidate route for delivery to minimize the total cost subject to certain business constraints. The OPA problem is challenging due to its stochastic nature: each parcel's candidate routes, which depends on the parcel's origin, destination, weight, etc., are unknown until its order is placed, and the total parcel volume is uncertain in advance. To tackle this challenge, we propose the PPO-OPA algorithm based on deep reinforcement learning that shows competitive performance. More specifically, we introduce a novel Markov Decision Process (MDP) framework to model the OPA problem, and develop a policy gradient algorithm that adopts attention networks for policy evaluation. By designing a dedicated reward function, our proposed algorithm can achieve a lower total cost with smaller violation of constraints, comparing to the traditional method which assigns parcels to candidate routes proportionally. In addition, the performances of our proposed algorithm and the Primal-Dual algorithm are comparable, while the later assumes a known total parcel volume in advance, which is unrealistic in practice.
    First-Order Algorithms for Nonlinear Generalized Nash Equilibrium Problems. (arXiv:2204.03132v2 [math.OC] UPDATED)
    We consider the problem of computing an equilibrium in a class of \textit{nonlinear generalized Nash equilibrium problems (NGNEPs)} in which the strategy sets for each player are defined by the equality and inequality constraints that may depend on the choices of rival players. While the asymptotic global convergence and local convergence rate of certain algorithms have been extensively investigated, the iteration complexity analysis is still in its infancy. This paper provides two first-order algorithms based on quadratic penalty method (QPM) and augmented Lagrangian method (ALM), respectively, with an accelerated mirror-prox algorithm as the solver in each inner loop. We show the nonasymptotic convergence rate for these algorithms. In particular, we establish the global convergence guarantee for solving monotone and strongly monotone NGNEPs and provide the complexity bounds expressed in terms of the number of gradient evaluations. Experimental results demonstrate the efficiency of our algorithms in practice.
    GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy. (arXiv:2104.10569v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been demonstrated as a powerful tool for analyzing non-Euclidean graph data. However, the lack of efficient distributed graph learning systems severely hinders applications of GNNs, especially when graphs are big and GNNs are relatively deep. Herein, we present GraphTheta, the first distributed and scalable graph learning system built upon vertex-centric distributed graph processing with neural network operators implemented as user-defined functions. This system supports multiple training strategies and enables efficient and scalable big-graph learning on distributed (virtual) machines with low memory. To facilitate graph convolutions, GraphTheta puts forward a new graph learning abstraction named NN-TGAR to bridge the gap between graph processing and graph deep learning. A distributed graph engine is proposed to conduct the stochastic gradient descent optimization with a hybrid-parallel execution, and a new cluster-batched training strategy is supported. We evaluate GraphTheta using several datasets with network sizes ranging from small-, modest- to large-scale. Experimental results show that GraphTheta can scale well to 1,024 workers for training an in-house developed GNN on an industry-scale Alipay dataset of 1.4 billion nodes and 4.1 billion attributed edges, with a cluster of CPU virtual machines (dockers) of small memory each (5$\sim$12GB). Moreover, GraphTheta can outperform DistDGL by up to $2.02\times$, with better scalability, and GraphLearn by up to $30.56\times$. As for model accuracy, GraphTheta is capable of learning as good GNNs as existing frameworks. To the best of our knowledge, this work presents the largest edge-attributed GNN learning task in the literature.
    Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. (arXiv:2106.13624v2 [cs.LG] UPDATED)
    We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev- Cantelli inequality (a.k.a. one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive a new concentration of measure inequality, which we name PAC-Bayes-Bennett, since it combines PAC-Bayesian bounding with Bennett's inequality. We use it for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality of Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work of Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.
    The Prominence of Artificial Intelligence in COVID-19. (arXiv:2111.09537v2 [cs.LG] UPDATED)
    In December 2019, a novel virus called COVID-19 had caused an enormous number of causalities to date. The battle with the novel Coronavirus is baffling and horrifying after the Spanish Flu 2019. While the front-line doctors and medical researchers have made significant progress in controlling the spread of the highly contiguous virus, technology has also proved its significance in the battle. Moreover, Artificial Intelligence has been adopted in many medical applications to diagnose many diseases, even baffling experienced doctors. Therefore, this survey paper explores the methodologies proposed that can aid doctors and researchers in early and inexpensive methods of diagnosis of the disease. Most developing countries have difficulties carrying out tests using the conventional manner, but a significant way can be adopted with Machine and Deep Learning. On the other hand, the access to different types of medical images has motivated the researchers. As a result, a mammoth number of techniques are proposed. This paper first details the background knowledge of the conventional methods in the Artificial Intelligence domain. Following that, we gather the commonly used datasets and their use cases to date. In addition, we also show the percentage of researchers adopting Machine Learning over Deep Learning. Thus we provide a thorough analysis of this scenario. Lastly, in the research challenges, we elaborate on the problems faced in COVID-19 research, and we address the issues with our understanding to build a bright and healthy environment.
    SITTA: Single Image Texture Translation for Data Augmentation. (arXiv:2106.13804v2 [cs.CV] UPDATED)
    Recent advances in data augmentation enable one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with results evaluated largely in a subjective manner. Relatively few works in this area, however, study the potential use of image synthesis methods for recognition tasks. In this paper, we propose and explore the problem of image translation for data augmentation. We first propose a lightweight yet efficient model for translating texture to augment images based on a single input of source texture, allowing for fast training and testing, referred to as Single Image Texture Translation for data Augmentation (SITTA). Then we explore the use of augmented data in long-tailed and few-shot image classification tasks. We find the proposed augmentation method and workflow is capable of translating the texture of input data into a target domain, leading to consistently improved image recognition performance. Finally, we examine how SITTA and related image translation methods can provide a basis for a data-efficient, "augmentation engineering" approach to model training. Codes are available at https://github.com/Boyiliee/SITTA.
    Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets. (arXiv:2106.13919v2 [cs.LG] UPDATED)
    This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems, and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the \emph{positive} edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets.
    Genetic algorithm for feature selection of EEG heterogeneous data. (arXiv:2103.07117v2 [cs.NE] UPDATED)
    The electroencephalographic (EEG) signals provide highly informative data on brain activities and functions. However, their heterogeneity and high dimensionality may represent an obstacle for their interpretation. The introduction of a priori knowledge seems the best option to mitigate high dimensionality problems, but could lose some information and patterns present in the data, while data heterogeneity remains an open issue that often makes generalization difficult. In this study, we propose a genetic algorithm (GA) for feature selection that can be used with a supervised or unsupervised approach. Our proposal considers three different fitness functions without relying on expert knowledge. Starting from two publicly available datasets on cognitive workload and motor movement/imagery, the EEG signals are processed, normalized and their features computed in the time, frequency and time-frequency domains. The feature vector selection is performed by applying our GA proposal and compared with two benchmarking techniques. The results show that different combinations of our proposal achieve better results in respect to the benchmark in terms of overall performance and feature reduction. Moreover, the proposed GA, based on a novel fitness function here presented, outperforms the benchmark when the two different datasets considered are merged together, showing the effectiveness of our proposal on heterogeneous data.
    Relay Variational Inference: A Method for Accelerated Encoderless VI. (arXiv:2110.13422v2 [cs.LG] UPDATED)
    Variational Inference (VI) offers a method for approximating intractable likelihoods. In neural VI, inference of approximate posteriors is commonly done using an encoder. Alternatively, encoderless VI offers a framework for learning generative models from data without encountering suboptimalities caused by amortization via an encoder (e.g. in presence of missing or uncertain data). However, in absence of an encoder, such methods often suffer in convergence due to the slow nature of gradient steps required to learn the approximate posterior parameters. In this paper, we introduce Relay VI (RVI), a framework that dramatically improves both the convergence and performance of encoderless VI. In our experiments over multiple datasets, we study the effectiveness of RVI in terms of convergence speed, loss, representation power and missing data imputation. We find RVI to be a unique tool, often superior in both performance and convergence speed to previously proposed encoderless as well as amortized VI models (e.g. VAE).
    Labels, Information, and Computation: Efficient Learning Using Sufficient Labels. (arXiv:2104.09015v3 [cs.LG] UPDATED)
    In supervised learning, obtaining a large set of fully-labeled training data is expensive. We show that we do not always need full label information on every single training example to train a competent classifier. Specifically, inspired by the principle of sufficiency in statistics, we present a statistic (a summary) of the fully-labeled training set that captures almost all the relevant information for classification but at the same time is easier to obtain directly. We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency for finding the optimal hidden representations, on which competent classifier heads can be trained using as few as a single randomly-chosen fully-labeled example per class. Sufficiently-labeled data can be obtained from annotators directly without collecting the fully-labeled data first. And we prove that it is easier to directly obtain sufficiently-labeled data than obtaining fully-labeled data. Furthermore, sufficiently-labeled data is naturally more secure since it stores relative, instead of absolute, information. Extensive experimental results are provided to support our theory.
    Random Planted Forest: a directly interpretable tree ensemble. (arXiv:2012.14563v2 [stat.ML] UPDATED)
    We introduce a novel interpretable, tree based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional decomposition perspective, where each tree corresponds to a function within that decomposition. The maximal order of approximation in the decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model places no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in cases where the maximal order of approximation is low. We show that if the order is smaller than three, the idealised version achieves asymptotically optimal convergence rates up to a logarithmic factor. ode is available on https://github.com/PlantedML/randomPlantedForest
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v3 [cs.LG] UPDATED)
    While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v3 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.
    On the Sample Complexity of Stability Constrained Imitation Learning. (arXiv:2102.09161v3 [cs.LG] UPDATED)
    We study the following question in the context of imitation learning for continuous control: how are the underlying stability properties of an expert policy reflected in the sample-complexity of an imitation learning task? We provide the first results showing that a surprisingly granular connection can be made between the underlying expert system's incremental gain stability, a novel measure of robust convergence between pairs of system trajectories, and the dependency on the task horizon $T$ of the resulting generalization bounds. In particular, we propose and analyze incremental gain stability constrained versions of behavior cloning and a DAgger-like algorithm, and show that the resulting sample-complexity bounds naturally reflect the underlying stability properties of the expert system. As a special case, we delineate a class of systems for which the number of trajectories needed to achieve $\varepsilon$-suboptimality is sublinear in the task horizon $T$, and do so without requiring (strong) convexity of the loss function in the policy parameters. Finally, we conduct numerical experiments demonstrating the validity of our insights on both a simple nonlinear system for which the underlying stability properties can be easily tuned, and on a high-dimensional quadrupedal robotic simulation.
    Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks. (arXiv:2007.01498v2 [cs.AI] UPDATED)
    In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.
    Sustainable Recreational Fishing Using a Novel Electrical Muscle Stimulation (EMS) Lure and Ensemble Network Algorithm to Maximize Catch and Release Survivability. (arXiv:2006.10125v2 [cs.CV] UPDATED)
    With 200-700 million anglers in the world, sportfishing is nearly five times more common than commercial trawling. Worldwide, hundreds of thousands of jobs are linked to the sportfishing industry, which generates billions of dollars for water-side communities and fisheries conservatories alike. However, the sheer popularity of recreational fishing poses threats to aquatic biodiversity that are hard to regulate. For example, as much as 25% of overfished populations can be traced to anglers. This alarming statistic is explained by the average catch and release mortality rate of 43%, which primarily results from hook-related injuries and careless out-of-water handling. The provisional-patented design proposed in this paper addresses both these problems separately First, a novel, electrical muscle stimulation based fishing lure is proposed as a harmless and low cost alternative to sharp hooks. Early prototypes show a constant electrical current of 90 mA applied through a 200g European perch's jaw can support a reeling tension of 2N - safely within the necessary ranges. Second, a fish-eye camera bob is designed to wirelessly relay underwater footage to a smartphone app, where an ensemble convolutional neural network automatically classifies the fish's species, estimates its length, and cross references with local and state fishing regulations (ie. minimum size, maximum bag limit, and catch season). This capability reduces overfishing by helping anglers avoid accidentally violating guidelines and eliminates the need to reel the fish in and expose it to negligent handling. IN conjunction, this cheap, lightweight, yet high-tech invention is a paradigm shift in preserving a world favorite pastime; while at the same time making recreational fishing more sustainable.
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v3 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.
    Decentralized Exploration in Multi-Armed Bandits -- Extended version. (arXiv:1811.07763v6 [cs.LG] UPDATED)
    We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm Decentralized Elimination, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Then, thanks to the genericity of the approach, we extend the proposed algorithm to the non-stationary bandits. Finally, experiments illustrate and complete the analysis.
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v5 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search, and also on a real-world YouTube dataset. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
    Approximation Theory of Tree Tensor Networks: Tensorized Univariate Functions -- Part I. (arXiv:2007.00118v4 [math.FA] UPDATED)
    We study the approximation of functions by tensor networks (TNs). We show that Lebesgue $L^p$-spaces in one dimension can be identified with tensor product spaces of arbitrary order through tensorization. We use this tensor product structure to define subsets of $L^p$ of rank-structured functions of finite representation complexity. These subsets are then used to define different approximation classes of tensor networks, associated with different measures of complexity. These approximation classes are shown to be quasi-normed linear spaces. We study some elementary properties and relationships of said spaces. In part II of this work, we will show that classical smoothness (Besov) spaces are continuously embedded into these approximation classes. We will also show that functions in these approximation classes do not possess any Besov smoothness, unless one restricts the depth of the tensor networks. The results of this work are both an analysis of the approximation spaces of TNs and a study of the expressivity of a particular type of neural networks (NN) -- namely feed-forward sum-product networks with sparse architecture. The input variables of this network result from the tensorization step, interpreted as a particular featuring step which can also be implemented with a neural network with a specific architecture. We point out interesting parallels to recent results on the expressivity of rectified linear unit (ReLU) networks -- currently one of the most popular type of NNs.
    Approximation Theory of Tree Tensor Networks: Tensorized Univariate Functions -- Part II. (arXiv:2007.00128v4 [math.FA] UPDATED)
    We study the approximation by tensor networks (TNs) of functions from classical smoothness classes. The considered approximation tool combines a tensorization of functions in $L^p([0,1))$, which allows to identify a univariate function with a multivariate function (or tensor), and the use of tree tensor networks (the tensor train format) for exploiting low-rank structures of multivariate functions. The resulting tool can be interpreted as a feed-forward neural network, with first layers implementing the tensorization, interpreted as a particular featuring step, followed by a sum-product network with sparse architecture. In part I of this work, we presented several approximation classes associated with different measures of complexity of tensor networks and studied their properties. In this work (part II), we show how classical approximation tools, such as polynomials or splines (with fixed or free knots), can be encoded as a tensor network with controlled complexity. We use this to derive direct (Jackson) inequalities for the approximation spaces of tensor networks. This is then utilized to show that Besov spaces are continuously embedded into these approximation spaces. In other words, we show that arbitrary Besov functions can be approximated with optimal or near to optimal rate. We also show that an arbitrary function in the approximation class possesses no Besov smoothness, unless one limits the depth of the tensor network.
    Nonlinear Independent Component Analysis for Discrete-Time and Continuous-Time Signals. (arXiv:2102.02876v3 [stat.ML] UPDATED)
    We study the classical problem of recovering a multidimensional source signal from observations of nonlinear mixtures of this signal. We show that this recovery is possible (up to a permutation and monotone scaling of the source's original component signals) if the mixture is due to a sufficiently differentiable and invertible but otherwise arbitrarily nonlinear function and the component signals of the source are statistically independent with 'non-degenerate' second-order statistics. The latter assumption requires the source signal to meet one of three regularity conditions which essentially ensure that the source is sufficiently far away from the non-recoverable extremes of being deterministic or constant in time. These assumptions, which cover many popular time series models and stochastic processes, allow us to reformulate the initial problem of nonlinear blind source separation as a simple-to-state problem of optimisation-based function approximation. We propose to solve this approximation problem by minimizing a novel type of objective function that efficiently quantifies the mutual statistical dependence between multiple stochastic processes via cumulant-like statistics. This yields a scalable and direct new method for nonlinear Independent Component Analysis with widely applicable theoretical guarantees and for which our experiments indicate good performance.
    Robust Max Entrywise Error Bounds for Tensor Estimation from Sparse Observations via Similarity Based Collaborative Filtering. (arXiv:1908.01241v4 [cs.LG] UPDATED)
    Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for estimating a tensor from sparse observations and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a finite rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \kappa})$ for any arbitrarily small $\kappa > 0$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\varepsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by $\text{poly}(\varepsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\varepsilon \geq 0$ by a finite rank tensor, then the estimation error converges to $\text{poly}(\varepsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.
    Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning. (arXiv:1909.05850v6 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.
    Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet Implementation for Edge Motor-Imagery Brain--Machine Interfaces. (arXiv:2004.11690v3 [eess.SP] UPDATED)
    Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines by analyzing brain activities recorded with Electroencephalography (EEG). Latency, reliability, and privacy constraints make it unsuitable to offload the computation to the cloud. Practical use cases demand a wearable, battery-operated device with low average power consumption for long-term use. Recently, sophisticated algorithms, in particular deep learning models, have emerged for classifying EEG signals. While reaching outstanding accuracy, these models often exceed the limitations of edge devices due to their memory and computational requirements. In this paper, we demonstrate algorithmic and implementation optimizations for EEGNET, a compact Convolutional Neural Network (CNN) suitable for many BMI paradigms. We quantize weights and activations to 8-bit fixed-point with a negligible accuracy loss of 0.4% on 4-class MI, and present an energy-efficient hardware-aware implementation on the Mr.Wolf parallel ultra-low power (PULP) System-on-Chip (SoC) by utilizing its custom RISC-V ISA extensions and 8-core compute cluster. With our proposed optimization steps, we can obtain an overall speedup of 64x and a reduction of up to 85% in memory footprint with respect to a single-core layer-wise baseline implementation. Our implementation takes only 5.82 ms and consumes 0.627 mJ per inference. With 21.0GMAC/s/W, it is 256x more energy-efficient than an EEGNET implementation on an ARM Cortex-M7 (0.082GMAC/s/W).
    Approximation Theory of Tree Tensor Networks: Tensorized Multivariate Functions. (arXiv:2101.11932v2 [math.FA] UPDATED)
    We study the approximation of multivariate functions with tensor networks (TNs). The main conclusion of this work is an answer to the following two questions: "What are the approximation capabilities of TNs?" and "What is an appropriate model class of functions that can be approximated with TNs?" To answer the former: we show that TNs can (near to) optimally replicate $h$-uniform and $h$-adaptive approximation, for any smoothness order of the target function. Tensor networks thus exhibit universal expressivity w.r.t. isotropic, anisotropic and mixed smoothness spaces that is comparable with more general neural networks families such as deep rectified linear unit (ReLU) networks. Put differently, TNs have the capacity to (near to) optimally approximate many function classes -- without being adapted to the particular class in question. To answer the latter: as a candidate model class we consider approximation classes of TNs and show that these are (quasi-)Banach spaces, that many types of classical smoothness spaces are continuously embedded into said approximation classes and that TN approximation classes are themselves not embedded in any classical smoothness space.
    Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms. (arXiv:2006.14514v4 [cs.LG] UPDATED)
    Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. In many cases, the gradient of any such loss function has superlinear growth, making the use of the widely-accepted (stochastic) gradient descent methods, which are based on Euler numerical schemes, problematic. We offer a new learning algorithm based on an appropriately constructed variant of the popular stochastic gradient Langevin dynamics (SGLD), which is called tamed unadjusted stochastic Langevin algorithm (TUSLA). We also provide a nonasymptotic analysis of the new algorithm's convergence properties in the context of non-convex learning problems with the use of ANNs. Thus, we provide finite-time guarantees for TUSLA to find approximate minimizers of both empirical and population risks. The roots of the TUSLA algorithm are based on the taming technology for diffusion processes with superlinear coefficients as developed in \citet{tamed-euler, SabanisAoAP} and for MCMC algorithms in \citet{tula}. Numerical experiments are presented which confirm the theoretical findings and illustrate the need for the use of the new algorithm in comparison to vanilla SGLD within the framework of ANNs.
    Learned Lifted Linearization Applied to Unstable Dynamic Systems Enabled by Koopman Direct Encoding. (arXiv:2210.13602v2 [cs.LG] UPDATED)
    This paper presents a Koopman lifting linearization method that is applicable to nonlinear dynamical systems having both stable and unstable regions. It is known that DMD and other standard data-driven methods face a fundamental difficulty in constructing a Koopman model when applied to unstable systems. Here we solve the problem by incorporating knowledge about a nonlinear state equation with a learning method for finding an effective set of observables. In a lifted space, stable and unstable regions are separated into independent subspaces. Based on this property, we propose to find effective observables through neural net training where training data are separated into stable and unstable trajectories. The resultant learned observables are used for constructing a linear state transition matrix using method known as Direct Encoding, which transforms the nonlinear state equation to a state transition matrix through inner product computations with the observables. The proposed method shows a dramatic improvement over existing DMD and data-driven methods.
    Identifying Time Lag in Dynamical Systems with Copula Entropy based Transfer Entropy. (arXiv:2301.06037v1 [cs.LG])
    Time lag between variables is a key characteristics of dynamical systems in different fields and identifying such time lag is a central problem in complex systems with many applications. Transfer Entropy (TE) was proposed as a tool for time lag identification recently. Unfortunately, estimating TE has been a notoriously difficult problem. Copula Entropy (CE) is a measure of statistical independence and it was proved that TE can be represented with only CE. Therefore, a non-parametric estimator of TE based on CE was proposed according to such representation recently. In this paper we propose to use the CE-based estimator of TE to identify time lag in dynamical systems. Both simulated and real data are used to verify the effectiveness of the proposed method in the experiments. Experimental results show that the proposed method can identify the time lags in the three simulated systems. The real data experiment with the data on power consumption of the Tetouan city also demonstrates that our method can identify the pattern of time lags through the estimated TE from the weather factors to the power consumption of the city.
    Pluto's Surface Mapping using Unsupervised Learning from Near-Infrared Observations of LEISA/Ralph. (arXiv:2301.06027v1 [astro-ph.EP])
    We map the surface of Pluto using an unsupervised machine learning technique using the near-infrared observations of the LEISA/Ralph instrument onboard NASA's New Horizons spacecraft. The principal component reduced Gaussian mixture model was implemented to investigate the geographic distribution of the surface units across the dwarf planet. We also present the likelihood of each surface unit at the image pixel level. Average I/F spectra of each unit were analyzed -- in terms of the position and strengths of absorption bands of abundant volatiles such as N${}_{2}$, CH${}_{4}$, and CO and nonvolatile H${}_{2}$O -- to connect the unit to surface composition, geology, and geographic location. The distribution of surface units shows a latitudinal pattern with distinct surface compositions of volatiles -- consistent with the existing literature. However, previous mapping efforts were based primarily on compositional analysis using spectral indices (indicators) or implementation of complex radiative transfer models, which need (prior) expert knowledge, label data, or optical constants of representative endmembers. We prove that an application of unsupervised learning in this instance renders a satisfactory result in mapping the spatial distribution of ice compositions without any prior information or label data. Thus, such an application is specifically advantageous for a planetary surface mapping when label data are poorly constrained or completely unknown, because an understanding of surface material distribution is vital for volatile transport modeling at the planetary scale. We emphasize that the unsupervised learning used in this study has wide applicability and can be expanded to other planetary bodies of the Solar System for mapping surface material distribution.
    Self-recovery of memory via generative replay. (arXiv:2301.06030v1 [cs.NE])
    A remarkable capacity of the brain is its ability to autonomously reorganize memories during offline periods. Memory replay, a mechanism hypothesized to underlie biological offline learning, has inspired offline methods for reducing forgetting in artificial neural networks in continual learning settings. A memory-efficient and neurally-plausible method is generative replay, which achieves state of the art performance on continual learning benchmarks. However, unlike the brain, standard generative replay does not self-reorganize memories when trained offline on its own replay samples. We propose a novel architecture that augments generative replay with an adaptive, brain-like capacity to autonomously recover memories. We demonstrate this capacity of the architecture across several continual learning tasks and environments.
    Semantic and Effective Communication for Remote Control Tasks with Dynamic Feature Compression. (arXiv:2301.05901v1 [cs.LG])
    The coordination of robotic swarms and the remote wireless control of industrial systems are among the major use cases for 5G and beyond systems: in these cases, the massive amounts of sensory information that needs to be shared over the wireless medium can overload even high-capacity connections. Consequently, solving the effective communication problem by optimizing the transmission strategy to discard irrelevant information can provide a significant advantage, but is often a very complex task. In this work, we consider a prototypal system in which an observer must communicate its sensory data to an actor controlling a task (e.g., a mobile robot in a factory). We then model it as a remote Partially Observable Markov Decision Process (POMDP), considering the effect of adopting semantic and effective communication-oriented solutions on the overall system performance. We split the communication problem by considering an ensemble Vector Quantized Variational Autoencoder (VQ-VAE) encoding, and train a Deep Reinforcement Learning (DRL) agent to dynamically adapt the quantization level, considering both the current state of the environment and the memory of past messages. We tested the proposed approach on the well-known CartPole reference control problem, obtaining a significant performance increase over traditional approaches
    An Accurate EEGNet-based Motor-Imagery Brain-Computer Interface for Low-Power Edge Computing. (arXiv:2004.00077v3 [eess.SP] UPDATED)
    This paper presents an accurate and robust embedded motor-imagery brain-computer interface (MI-BCI). The proposed novel model, based on EEGNet, matches the requirements of memory footprint and computational resources of low-power microcontroller units (MCUs), such as the ARM Cortex-M family. Furthermore, the paper presents a set of methods, including temporal downsampling, channel selection, and narrowing of the classification window, to further scale down the model to relax memory requirements with negligible accuracy degradation. Experimental results on the Physionet EEG Motor Movement/Imagery Dataset show that standard EEGNet achieves 82.43%, 75.07%, and 65.07% classification accuracy on 2-, 3-, and 4-class MI tasks in global validation, outperforming the state-of-the-art (SoA) convolutional neural network (CNN) by 2.05%, 5.25%, and 5.48%. Our novel method further scales down the standard EEGNet at a negligible accuracy loss of 0.31% with 7.6x memory footprint reduction and a small accuracy loss of 2.51% with 15x reduction. The scaled models are deployed on a commercial Cortex-M4F MCU taking 101ms and consuming 4.28mJ per inference for operating the smallest model, and on a Cortex-M7 with 44ms and 18.1mJ per inference for the medium-sized model, enabling a fully autonomous, wearable, and accurate low-power BCI.
    Micro and Macro Level Graph Modeling for Graph Variational Auto-Encoders. (arXiv:2210.16844v2 [cs.LG] UPDATED)
    Generative models for graph data are an important research topic in machine learning. Graph data comprise two levels that are typically analyzed separately: node-level properties such as the existence of a link between a pair of nodes, and global aggregate graph-level statistics, such as motif counts. This paper proposes a new multi-level framework that jointly models node-level properties and graph-level statistics, as mutually reinforcing sources of information. We introduce a new micro-macro training objective for graph generation that combines node-level and graph-level losses. We utilize the micro-macro objective to improve graph generation with a GraphVAE, a well-established model based on graph-level latent variables, that provides fast training and generation time for medium-sized graphs. Our experiments show that adding micro-macro modeling to the GraphVAE model improves graph quality scores up to 2 orders of magnitude on five benchmark datasets, while maintaining the GraphVAE generation speed advantage.
    EvoAAA: An evolutionary methodology for automated \neural autoencoder architecture search. (arXiv:2301.06047v1 [cs.NE])
    Machine learning models work better when curated features are provided to them. Feature engineering methods have been usually used as a preprocessing step to obtain or build a proper feature set. In late years, autoencoders (a specific type of symmetrical neural network) have been widely used to perform representation learning, proving their competitiveness against classical feature engineering algorithms. The main obstacle in the use of autoencoders is finding a good architecture, a process that most experts confront manually. An automated autoencoder architecture search procedure, based on evolutionary methods, is proposed in this paper. The methodology is tested against nine heterogeneous data sets. The obtained results show the ability of this approach to find better architectures, able to concentrate most of the useful information in a minimized coding, in a reduced time.
    Margin Optimal Classification Trees. (arXiv:2210.10567v4 [math.OC] UPDATED)
    In recent years there has been growing attention to interpretable machine learning models which can give explanatory insights on their behavior. Thanks to their interpretability, decision trees have been intensively studied for classification tasks, and due to the remarkable advances in mixed-integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed-integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses the use of maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing local sparsity of the hyperplanes. First, MARGOT has been tested on non-linearly separable synthetic datasets in 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions are effective in selecting the most relevant features and maintaining good prediction quality.
    A Review on the effectiveness of Dimensional Reduction with Computational Forensics: An Application on Malware Analysis. (arXiv:2301.06031v1 [cs.CR])
    The Android operating system is pervasively adopted as the operating system platform of choice for smart devices. However, the strong adoption has also resulted in exponential growth in the number of Android based malicious software or malware. To deal with such cyber threats as part of cyber investigation and digital forensics, computational techniques in the form of machine learning algorithms are applied for such malware identification, detection and forensics analysis. However, such Computational Forensics modelling techniques are constrained the volume, velocity, variety and veracity of the malware landscape. This in turn would affect its identification and detection effectiveness. Such consequence would inherently induce the question of sustainability with such solution approach. One approach to optimise effectiveness is to apply dimensional reduction techniques like Principal Component Analysis with the intent to enhance algorithmic performance. In this paper, we evaluate the effectiveness of the application of Principle Component Analysis on Computational Forensics task of detecting Android based malware. We applied our research hypothesis to three different datasets with different machine learning algorithms. Our research result showed that the dimensionally reduced dataset would result in a measure of degradation in accuracy performance.
    On the role of Model Uncertainties in Bayesian Optimization. (arXiv:2301.05983v1 [stat.ML])
    Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and compare them across both synthetic and real-world experiments. Our results confirm that Gaussian Processes are strong surrogate models and that they tend to outperform other popular models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of model in the analysis. We also studied the effect of re-calibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.
    Sinkhorn Divergences for Unbalanced Optimal Transport. (arXiv:1910.12958v3 [math.OC] UPDATED)
    Optimal transport induces the Earth Mover's (Wasserstein) distance between probability distributions, a geometric divergence that is relevant to a wide range of problems. Over the last decade, two relaxations of optimal transport have been studied in depth: unbalanced transport, which is robust to the presence of outliers and can be used when distributions don't have the same total mass; entropy-regularized transport, which is robust to sampling noise and lends itself to fast computations using the Sinkhorn algorithm. This paper combines both lines of work to put robust optimal transport on solid ground. Our main contribution is a generalization of the Sinkhorn algorithm to unbalanced transport: our method alternates between the standard Sinkhorn updates and the pointwise application of a contractive function. This implies that entropic transport solvers on grid images, point clouds and sampled distributions can all be modified easily to support unbalanced transport, with a proof of linear convergence that holds in all settings. We then show how to use this method to define pseudo-distances on the full space of positive measures that satisfy key geometric axioms: (unbalanced) Sinkhorn divergences are differentiable, positive, definite, convex, statistically robust and avoid any "entropic bias" towards a shrinkage of the measures' supports.
    On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning. (arXiv:2301.06010v1 [cs.LG])
    When there are unlabeled Out-Of-Distribution (OOD) data from other classes, Semi-Supervised Learning (SSL) methods suffer from severe performance degradation and even get worse than merely training on labeled data. In this paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL. PL is a simple and representative SSL method that transforms SSL problems into supervised learning by creating pseudo-labels for unlabeled data according to the model's prediction. We aim to answer two main questions: (1) How do OOD data influence PL? (2) What is the proper usage of OOD data with PL? First, we show that the major problem of PL is imbalanced pseudo-labels on OOD data. Second, we find that OOD data can help classify In-Distribution (ID) data given their OOD ground truth labels. Based on the findings, we propose to improve PL in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling (RPL) and Semantic Exploration Clustering (SEC). RPL re-balances pseudo-labels of high-confidence data, which simultaneously filters out OOD data and addresses the imbalance problem. SEC uses balanced clustering on low-confidence data to create pseudo-labels on extra classes, simulating the process of training with ground truth. Experiments show that our method achieves steady improvement over supervised baseline and state-of-the-art performance under all class mismatch ratios on different benchmarks.
    Interpretable and Scalable Graphical Models for Complex Spatio-temporal Processes. (arXiv:2301.06021v1 [cs.LG])
    This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.
    Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data. (arXiv:2210.08642v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.
    Static, dynamic and stability analysis of multi-dimensional functional graded plate with variable thickness using deep neural network. (arXiv:2301.05900v1 [cs.LG])
    The goal of this paper is to analyze and predict the central deflection, natural frequency, and critical buckling load of the multi-directional functionally graded (FG) plate with variable thickness resting on an elastic Winkler foundation. First, the mathematical models of the static and eigenproblems are formulated in great detail. The FG material properties are assumed to vary smoothly and continuously throughout three directions of the plate according to a Mori-Tanaka micromechanics model distribution of volume fraction of constituents. Then, finite element analysis (FEA) with mixed interpolation of tensorial components of 4-nodes (MITC4) is implemented in order to eliminate theoretically a shear locking phenomenon existing. Next, influences of the variable thickness functions (uniform, non-uniform linear, and non-uniform non-linear), material properties, length-to-thickness ratio, boundary conditions, and elastic parameters on the plate response are investigated and discussed in detail through several numerical examples. Finally, a deep neural network (DNN) technique using batch normalization (BN) is learned to predict the non-dimensional values of multi-directional FG plates. The DNN model also shows that it is a powerful technique capable of handling an extensive database and different vital parameters in engineering applications.
    Deep-Reinforcement-Learning-based Path Planning for Industrial Robots using Distance Sensors as Observation. (arXiv:2301.05980v1 [cs.RO])
    Industrial robots are widely used in various manufacturing environments due to their efficiency in doing repetitive tasks such as assembly or welding. A common problem for these applications is to reach a destination without colliding with obstacles or other robot arms. Commonly used sampling-based path planning approaches such as RRT require long computation times, especially in complex environments. Furthermore, the environment in which they are employed needs to be known beforehand. When utilizing the approaches in new environments, a tedious engineering effort in setting hyperparameters needs to be conducted, which is time- and cost-intensive. On the other hand, Deep Reinforcement Learning has shown remarkable results in dealing with unknown environments, generalizing new problem instances, and solving motion planning problems efficiently. On that account, this paper proposes a Deep-Reinforcement-Learning-based motion planner for robotic manipulators. We evaluated our model against state-of-the-art sampling-based planners in several experiments. The results show the superiority of our planner in terms of path length and execution time.
    Recent advances in artificial intelligence for retrosynthesis. (arXiv:2301.05864v1 [cs.LG])
    Retrosynthesis is the cornerstone of organic chemistry, providing chemists in material and drug manufacturing access to poorly available and brand-new molecules. Conventional rule-based or expert-based computer-aided synthesis has obvious limitations, such as high labor costs and limited search space. In recent years, dramatic breakthroughs driven by artificial intelligence have revolutionized retrosynthesis. Here we aim to present a comprehensive review of recent advances in AI-based retrosynthesis. For single-step and multi-step retrosynthesis both, we first list their goal and provide a thorough taxonomy of existing methods. Afterwards, we analyze these methods in terms of their mechanism and performance, and introduce popular evaluation metrics for them, in which we also provide a detailed comparison among representative methods on several public datasets. In the next part we introduce popular databases and established platforms for retrosynthesis. Finally, this review concludes with a discussion about promising research directions in this field.
    Drug Synergistic Combinations Predictions via Large-Scale Pre-Training and Graph Structure Learning. (arXiv:2301.05931v1 [cs.LG])
    Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. However, identifying novel drug combinations through wet-lab experiments is resource intensive due to the vast combinatorial search space. Recently, computational approaches, specifically deep learning models have emerged as an efficient way to discover synergistic combinations. While previous methods reported fair performance, their models usually do not take advantage of multi-modal data and they are unable to handle new drugs or cell lines. In this study, we collected data from various datasets covering various drug-related aspects. Then, we take advantage of large-scale pre-training models to generate informative representations and features for drugs, proteins, and diseases. Based on that, a message-passing graph is built on top to propagate information together with graph structure learning flexibility. This is first introduced in the biological networks and enables us to generate pseudo-relations in the graph. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods on synergistic prediction benchmark datasets. We are also capable of inferencing new drug combination data in a test on an independent set released by AstraZeneca, where 10% of improvement over previous methods is observed. In addition, we're robust against unseen drugs and surpass almost 15% AU ROC compared to the second-best model. We believe our framework contributes to both the future wet-lab discovery of novel drugs and the building of promising guidance for precise combination medicine.
    Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint). (arXiv:2301.05965v1 [cs.DB])
    Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers. Desbordante aims to open industrial-grade primitive discovery to a broader public, focusing on domain experts who are not IT professionals. Aside from the discovery of various primitives, Desbordante offers primitive validation, which not only reports whether a given instance of primitive holds or not, but also points out what prevents it from holding via the use of special screens. Next, Desbordante supports pipelines - ready-to-use functionality implemented using the discovered primitives, for example, typo detection. We provide built-in pipelines, and the users can construct their own via provided Python bindings. Unlike other profilers, Desbordante works not only with tabular data, but with graph and transactional data as well. In this paper, we present Desbordante, the vision behind it and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. Additionally, we outline our future plans.
    World Models and Predictive Coding for Cognitive and Developmental Robotics: Frontiers and Challenges. (arXiv:2301.05832v1 [cs.RO])
    Creating autonomous robots that can actively explore the environment, acquire knowledge and learn skills continuously is the ultimate achievement envisioned in cognitive and developmental robotics. Their learning processes should be based on interactions with their physical and social world in the manner of human learning and cognitive development. Based on this context, in this paper, we focus on the two concepts of world models and predictive coding. Recently, world models have attracted renewed attention as a topic of considerable interest in artificial intelligence. Cognitive systems learn world models to better predict future sensory observations and optimize their policies, i.e., controllers. Alternatively, in neuroscience, predictive coding proposes that the brain continuously predicts its inputs and adapts to model its own dynamics and control behavior in its environment. Both ideas may be considered as underpinning the cognitive development of robots and humans capable of continual or lifelong learning. Although many studies have been conducted on predictive coding in cognitive robotics and neurorobotics, the relationship between world model-based approaches in AI and predictive coding in robotics has rarely been discussed. Therefore, in this paper, we clarify the definitions, relationships, and status of current research on these topics, as well as missing pieces of world models and predictive coding in conjunction with crucially related concepts such as the free-energy principle and active inference in the context of cognitive and developmental robotics. Furthermore, we outline the frontiers and challenges involved in world models and predictive coding toward the further integration of AI and robotics, as well as the creation of robots with real cognitive and developmental capabilities in the future.
    Hand Gesture Recognition through Reflected Infrared Light Wave Signals. (arXiv:2301.05955v1 [eess.SP])
    In this study, we present a wireless (non-contact) gesture recognition method using only incoherent light wave signals reflected from a human subject. In comparison to existing radar, light shadow, sound and camera-based sensing systems, this technology uses a low-cost ubiquitous light source (e.g., infrared LED) to send light towards the subject's hand performing gestures and the reflected light is collected by a light sensor (e.g., photodetector). This light wave sensing system recognizes different gestures from the variations of the received light intensity within a 20-35cm range. The hand gesture recognition results demonstrate up to 96% accuracy on average. The developed system can be utilized in numerous Human-computer Interaction (HCI) applications as a low-cost and non-contact gesture recognition technology.
    Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures. (arXiv:2301.05981v1 [cs.LG])
    Traditional reinforcement learning (RL) aims to maximize the expected total reward, while the risk of uncertain outcomes needs to be controlled to ensure reliable performance in a risk-averse setting. In this paper, we consider the problem of maximizing dynamic risk of a sequence of rewards in infinite-horizon Markov Decision Processes (MDPs). We adapt the Expected Conditional Risk Measures (ECRMs) to the infinite-horizon risk-averse MDP and prove its time consistency. Using a convex combination of expectation and conditional value-at-risk (CVaR) as a special one-step conditional risk measure, we reformulate the risk-averse MDP as a risk-neutral counterpart with augmented action space and manipulation on the immediate rewards. We further prove that the related Bellman operator is a contraction mapping, which guarantees the convergence of any value-based RL algorithms. Accordingly, we develop a risk-averse deep Q-learning framework, and our numerical studies based on two simple MDPs show that the risk-averse setting can reduce the variance and enhance robustness of the results.
    Generalized Invariant Matching Property via LASSO. (arXiv:2301.05975v1 [stat.ME])
    Learning under distribution shifts is a challenging task. One principled approach is to exploit the invariance principle via the structural causal models. However, the invariance principle is violated when the response is intervened, making it a difficult setting. In a recent work, the invariant matching property has been developed to shed light on this scenario and shows promising performance. In this work, we generalize the invariant matching property by formulating a high-dimensional problem with intrinsic sparsity. We propose a more robust and computation-efficient algorithm by leveraging a variant of Lasso, improving upon the existing algorithms.
    Adaptive Neural Networks Using Residual Fitting. (arXiv:2301.05744v1 [cs.LG])
    Current methods for estimating the required neural-network size for a given problem class have focused on methods that can be computationally intensive, such as neural-architecture search and pruning. In contrast, methods that add capacity to neural networks as needed may provide similar results to architecture search and pruning, but do not require as much computation to find an appropriate network size. Here, we present a network-growth method that searches for explainable error in the network's residuals and grows the network if sufficient error is detected. We demonstrate this method using examples from classification, imitation learning, and reinforcement learning. Within these tasks, the growing network can often achieve better performance than small networks that do not grow, and similar performance to networks that begin much larger.
    Day-Ahead PV Power Forecasting Based on MSTL-TFT. (arXiv:2301.05911v1 [cs.LG])
    Energy demand is increasing dramatically as global urbanization progresses.Solar energy is a clean energy source with low production and maintenance costs.Accurately predicted PV generation is of great importance for grid integration.Recent day-ahead PV forecasting studies mainly include generation data decomposition, additional meteorological and equipment features, improvement and integration of ANN-based models.We proposed a MSTL-TFT method for day-ahead PV forecasting. The results are better than any of the other studies we have surveyed on day-ahead DKASC PV forecasting.
    Lung airway geometry as an early predictor of autism: A preliminary machine learning-based study. (arXiv:2301.05777v1 [cs.LG])
    The goal of this study is to assess the feasibility of airway geometry as a biomarker for ASD. Chest CT images of children with a documented diagnosis of ASD as well as healthy controls were identified retrospectively. 54 scans were obtained for analysis, including 31 ASD cases and 23 age and sex-matched controls. A feature selection and classification procedure using principal component analysis (PCA) and support vector machine (SVM) achieved a peak cross validation accuracy of nearly 89% using a feature set of 8 airway branching angles. Sensitivity was 94%, but specificity was only 78%. The results suggest a measurable difference in airway branchpoint angles between children with ASD and the control population.
    CrysGNN : Distilling pre-trained knowledge to enhance property prediction for crystalline materials. (arXiv:2301.05852v1 [cs.LG])
    In recent years, graph neural network (GNN) based approaches have emerged as a powerful technique to encode complex topological structure of crystal materials in an enriched representation space. These models are often supervised in nature and using the property-specific training data, learn relationship between crystal structure and different properties like formation energy, bandgap, bulk modulus, etc. Most of these methods require a huge amount of property-tagged data to train the system which may not be available for different properties. However, there is an availability of a huge amount of crystal data with its chemical composition and structural bonds. To leverage these untapped data, this paper presents CrysGNN, a new pre-trained GNN framework for crystalline materials, which captures both node and graph level structural information of crystal graphs using a huge amount of unlabelled material data. Further, we extract distilled knowledge from CrysGNN and inject into different state of the art property predictors to enhance their property prediction accuracy. We conduct extensive experiments to show that with distilled knowledge from the pre-trained model, all the SOTA algorithms are able to outperform their own vanilla version with good margins. We also observe that the distillation process provides a significant improvement over the conventional approach of finetuning the pre-trained model. We have released the pre-trained model along with the large dataset of 800K crystal graph which we carefully curated; so that the pretrained model can be plugged into any existing and upcoming models to enhance their prediction accuracy.
    Discovery of 2D materials using Transformer Network based Generative Design. (arXiv:2301.05824v1 [cond-mat.mtrl-sci])
    Two-dimensional (2D) materials have wide applications in superconductors, quantum, and topological materials. However, their rational design is not well established, and currently less than 6,000 experimentally synthesized 2D materials have been reported. Recently, deep learning, data-mining, and density functional theory (DFT)-based high-throughput calculations are widely performed to discover potential new materials for diverse applications. Here we propose a generative material design pipeline, namely material transformer generator(MTG), for large-scale discovery of hypothetical 2D materials. We train two 2D materials composition generators using self-learning neural language models based on Transformers with and without transfer learning. The models are then used to generate a large number of candidate 2D compositions, which are fed to known 2D materials templates for crystal structure prediction. Next, we performed DFT computations to study their thermodynamic stability based on energy-above-hull and formation energy. We report four new DFT-verified stable 2D materials with zero e-above-hull energies, including NiCl$_4$, IrSBr, CuBr$_3$, and CoBrCl. Our work thus demonstrates the potential of our MTG generative materials design pipeline in the discovery of novel 2D materials and other functional materials.
    A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends. (arXiv:2301.05712v1 [cs.LG])
    Deep supervised learning algorithms generally require large numbers of labeled examples to attain satisfactory performance. To avoid the expensive cost incurred by collecting and labeling too many examples, as a subset of unsupervised learning, self-supervised learning (SSL) was proposed to learn good features from many unlabeled examples without any human-annotated labels. SSL has recently become a hot research topic, and many related algorithms have been proposed. However, few comprehensive studies have explained the connections among different SSL variants and how they have evolved. In this paper, we attempt to provide a review of the various SSL methods from the perspectives of algorithms, theory, applications, three main trends, and open questions. First, the motivations of most SSL algorithms are introduced in detail, and their commonalities and differences are compared. Second, the theoretical issues associated with SSL are investigated. Third, typical applications of SSL in areas such as image processing and computer vision (CV), as well as natural language processing (NLP), are discussed. Finally, the three main trends of SSL and the open research questions are discussed. A collection of useful materials is available at https://github.com/guijiejie/SSL.
    Survey of Knowledge Distillation in Federated Edge Learning. (arXiv:2301.05849v1 [cs.LG])
    The increasing demand for intelligent services and privacy protection of mobile and Internet of Things (IoT) devices motivates the wide application of Federated Edge Learning (FEL), in which devices collaboratively train on-device Machine Learning (ML) models without sharing their private data. \textcolor{black}{Limited by device hardware, diverse user behaviors and network infrastructure, the algorithm design of FEL faces challenges related to resources, personalization and network environments}, and Knowledge Distillation (KD) has been leveraged as an important technique to tackle the above challenges in FEL. In this paper, we investigate the works that KD applies to FEL, discuss the limitations and open problems of existing KD-based FEL approaches, and provide guidance for their real deployment.
    First Three Years of the International Verification of Neural Networks Competition (VNN-COMP). (arXiv:2301.05815v1 [cs.LG])
    This paper presents a summary and meta-analysis of the first three iterations of the annual International Verification of Neural Networks Competition (VNN-COMP) held in 2020, 2021, and 2022. In the VNN-COMP, participants submit software tools that analyze whether given neural networks satisfy specifications describing their input-output behavior. These neural networks and specifications cover a variety of problem classes and tasks, corresponding to safety and robustness properties in image classification, neural control, reinforcement learning, and autonomous systems. We summarize the key processes, rules, and results, present trends observed over the last three years, and provide an outlook into possible future developments.
    A Rigorous Uncertainty-Aware Quantification Framework Is Essential for Reproducible and Replicable Machine Learning Workflows. (arXiv:2301.05763v1 [cs.LG])
    The ability to replicate predictions by machine learning (ML) or artificial intelligence (AI) models and results in scientific workflows that incorporate such ML/AI predictions is driven by numerous factors. An uncertainty-aware metric that can quantitatively assess the reproducibility of quantities of interest (QoI) would contribute to the trustworthiness of results obtained from scientific workflows involving ML/AI models. In this article, we discuss how uncertainty quantification (UQ) in a Bayesian paradigm can provide a general and rigorous framework for quantifying reproducibility for complex scientific workflows. Such as framework has the potential to fill a critical gap that currently exists in ML/AI for scientific workflows, as it will enable researchers to determine the impact of ML/AI model prediction variability on the predictive outcomes of ML/AI-powered workflows. We expect that the envisioned framework will contribute to the design of more reproducible and trustworthy workflows for diverse scientific applications, and ultimately, accelerate scientific discoveries.
    CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence. (arXiv:2301.05872v1 [math.OC])
    In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized SGD for both smooth strongly convex objective functions and smooth nonconvex objective functions under unbiased compression operators. In particular, to our knowledge, CEDAS enjoys so far the shortest transient time (with respect to the graph specifics) for achieving the convergence rate of centralized SGD, which behaves as $\mathcal{O}(nC^3/(1-\lambda_2)^{2})$ under smooth strongly convex objective functions, and $\mathcal{O}(n^3C^6/(1-\lambda_2)^4)$ under smooth nonconvex objective functions, where $(1-\lambda_2)$ denotes the spectral gap of the mixing matrix, and $C>0$ is the compression-related parameter. Numerical experiments further demonstrate the effectiveness of the proposed algorithm.
    Poisoning Attacks and Defenses in Federated Learning: A Survey. (arXiv:2301.05795v1 [cs.CR])
    Federated learning (FL) enables the training of models among distributed clients without compromising the privacy of training datasets, while the invisibility of clients datasets and the training process poses a variety of security threats. This survey provides the taxonomy of poisoning attacks and experimental evaluation to discuss the need for robust FL.
    GAR: Generalized Autoregression for Multi-Fidelity Fusion. (arXiv:2301.05729v1 [stat.ML])
    In many scientific research and engineering applications where repeated simulations of complex systems are conducted, a surrogate is commonly adopted to quickly estimate the whole system. To reduce the expensive cost of generating training examples, it has become a promising approach to combine the results of low-fidelity (fast but inaccurate) and high-fidelity (slow but accurate) simulations. Despite the fast developments of multi-fidelity fusion techniques, most existing methods require particular data structures and do not scale well to high-dimensional output. To resolve these issues, we generalize the classic autoregression (AR), which is wildly used due to its simplicity, robustness, accuracy, and tractability, and propose generalized autoregression (GAR) using tensor formulation and latent features. GAR can deal with arbitrary dimensional outputs and arbitrary multifidelity data structure to satisfy the demand of multi-fidelity fusion for complex problems; it admits a fully tractable likelihood and posterior requiring no approximate inference and scales well to high-dimensional problems. Furthermore, we prove the autokrigeability theorem based on GAR in the multi-fidelity case and develop CIGAR, a simplified GAR with the exact predictive mean accuracy with computation reduction by a factor of d 3, where d is the dimensionality of the output. The empirical assessment includes many canonical PDEs and real scientific examples and demonstrates that the proposed method consistently outperforms the SOTA methods with a large margin (up to 6x improvement in RMSE) with only a couple high-fidelity training samples.
    A Comprehensive Survey of Graph-level Learning. (arXiv:2301.05860v1 [cs.LG])
    Graphs have a superior ability to represent relational data, like chemical compounds, proteins, and social networks. Hence, graph-level learning, which takes a set of graphs as input, has been applied to many tasks including comparison, regression, classification, and more. Traditional approaches to learning a set of graphs tend to rely on hand-crafted features, such as substructures. But while these methods benefit from good interpretability, they often suffer from computational bottlenecks as they cannot skirt the graph isomorphism problem. Conversely, deep learning has helped graph-level learning adapt to the growing scale of graphs by extracting features automatically and decoding graphs into low-dimensional representations. As a result, these deep graph learning methods have been responsible for many successes. Yet, there is no comprehensive survey that reviews graph-level learning starting with traditional learning and moving through to the deep learning approaches. This article fills this gap and frames the representative algorithms into a systematic taxonomy covering traditional learning, graph-level deep neural networks, graph-level graph neural networks, and graph pooling. To ensure a thoroughly comprehensive survey, the evolutions, interactions, and communications between methods from four different branches of development are also examined. This is followed by a brief review of the benchmark data sets, evaluation metrics, and common downstream applications. The survey concludes with 13 future directions of necessary research that will help to overcome the challenges facing this booming field.
    Local Model Explanations and Uncertainty Without Model Access. (arXiv:2301.05761v1 [cs.LG])
    We present a model-agnostic algorithm for generating post-hoc explanations and uncertainty intervals for a machine learning model when only a sample of inputs and outputs from the model is available, rather than direct access to the model itself. This situation may arise when model evaluations are expensive; when privacy, security and bandwidth constraints are imposed; or when there is a need for real-time, on-device explanations. Our algorithm constructs explanations using local polynomial regression and quantifies the uncertainty of the explanations using a bootstrapping approach. Through a simulation study, we show that the uncertainty intervals generated by our algorithm exhibit a favorable trade-off between interval width and coverage probability compared to the naive confidence intervals from classical regression analysis. We further demonstrate the capabilities of our method by applying it to black-box models trained on two real datasets.
    Who Should I Trust: AI or Myself? Leveraging Human and AI Correctness Likelihood to Promote Appropriate Trust in AI-Assisted Decision-Making. (arXiv:2301.05809v1 [cs.HC])
    In AI-assisted decision-making, it is critical for human decision-makers to know when to trust AI and when to trust themselves. However, prior studies calibrated human trust only based on AI confidence indicating AI's correctness likelihood (CL) but ignored humans' CL, hindering optimal team decision-making. To mitigate this gap, we proposed to promote humans' appropriate trust based on the CL of both sides at a task-instance level. We first modeled humans' CL by approximating their decision-making models and computing their potential performance in similar instances. We demonstrated the feasibility and effectiveness of our model via two preliminary studies. Then, we proposed three CL exploitation strategies to calibrate users' trust explicitly/implicitly in the AI-assisted decision-making process. Results from a between-subjects experiment (N=293) showed that our CL exploitation strategies promoted more appropriate human trust in AI, compared with only using AI confidence. We further provided practical implications for more human-compatible AI-assisted decision-making.
    Functional Neural Networks: Shift invariant models for functional data with applications to EEG classification. (arXiv:2301.05869v1 [cs.LG])
    It is desirable for statistical models to detect signals of interest independently of their position. If the data is generated by some smooth process, this additional structure should be taken into account. We introduce a new class of neural networks that are shift invariant and preserve smoothness of the data: functional neural networks (FNNs). For this, we use methods from functional data analysis (FDA) to extend multi-layer perceptrons and convolutional neural networks to functional data. We propose different model architectures, show that the models outperform a benchmark model from FDA in terms of accuracy and successfully use FNNs to classify electroencephalography (EEG) data.
    Insights Into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement. (arXiv:2206.13310v3 [eess.AS] UPDATED)
    The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter, which means that the restriction of a linear processing model and that of a separate processing of spatial and tempo-spectral information can potentially be overcome. However, the internal mechanisms that lead to good performance of such data-driven filters for multi-channel speech enhancement are not well understood. Therefore, in this work, we analyse the properties of a non-linear spatial filter realized by a DNN as well as its interdependency with temporal and spectral processing by carefully controlling the information sources (spatial, spectral, and temporal) available to the network. We confirm the superiority of a non-linear spatial processing model, which outperforms an oracle linear spatial filter in a challenging speaker extraction scenario for a low number of microphones by 0.24 POLQA score. Our analyses reveal that in particular spectral information should be processed jointly with spatial information as this increases the spatial selectivity of the filter. Our systematic evaluation then leads to a simple network architecture, that outperforms state-of-the-art network architectures on a speaker extraction task by 0.22 POLQA score and by 0.32 POLQA score on the CHiME3 data.
    Disentangling representations in Restricted Boltzmann Machines without adversaries. (arXiv:2206.11600v3 [cs.LG] UPDATED)
    A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct information about the data properties (labels). Unfortunately adversarial training is generally difficult to implement in practice. Here we propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated with four examples: the CelebA dataset of facial images, the two-dimensional Ising model, the MNIST dataset of handwritten digits, and the taxonomy of protein families. In addition, we show how our framework allows for analytically computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations.
    Current Trends in Deep Learning for Earth Observation: An Open-source Benchmark Arena for Image Classification. (arXiv:2207.07189v2 [cs.CV] UPDATED)
    We present AiTLAS: Benchmark Arena -- an open-source benchmark suite for evaluating state-of-the-art deep learning approaches for image classification in Earth Observation (EO). To this end, we present a comprehensive comparative analysis of more than 500 models derived from ten different state-of-the-art architectures and compare them to a variety of multi-class and multi-label classification tasks from 22 datasets with different sizes and properties. In addition to models trained entirely on these datasets, we benchmark models trained in the context of transfer learning, leveraging pre-trained model variants, as it is typically performed in practice. All presented approaches are general and can be easily extended to many other remote sensing image classification tasks not considered in this study. To ensure reproducibility and facilitate better usability and further developments, all of the experimental resources including the trained models, model configurations, and processing details of the datasets (with their corresponding splits used for training and evaluating the models) are publicly available on the repository: https://github.com/biasvariancelabs/aitlas-arena
    Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction. (arXiv:2206.07085v3 [cs.LG] UPDATED)
    Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v4 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Counterfactual Fairness with Partially Known Causal Graph. (arXiv:2205.13972v3 [cs.LG] UPDATED)
    Fair machine learning aims to avoid treating individuals or sub-populations unfavourably based on \textit{sensitive attributes}, such as gender and race. Those methods in fair machine learning that are built on causal inference ascertain discrimination and bias through causal effects. Though causality-based fair learning is attracting increasing attention, current methods assume the true causal graph is fully known. This paper proposes a general method to achieve the notion of counterfactual fairness when the true causal graph is unknown. To be able to select features that lead to counterfactual fairness, we derive the conditions and algorithms to identify ancestral relations between variables on a \textit{Partially Directed Acyclic Graph (PDAG)}, specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. Interestingly, we find that counterfactual fairness can be achieved as if the true causal graph were fully known, when specific background knowledge is provided: the sensitive attributes do not have ancestors in the causal graph. Results on both simulated and real-world datasets demonstrate the effectiveness of our method.
    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies. (arXiv:2204.01058v2 [math.PR] UPDATED)
    This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$.
    Policy Gradients using Variational Quantum Circuits. (arXiv:2203.10591v3 [quant-ph] UPDATED)
    Variational Quantum Circuits are being used as versatile Quantum Machine Learning models. Some empirical results exhibit an advantage in supervised and generative learning tasks. However, when applied to Reinforcement Learning, less is known. In this work, we considered a Variational Quantum Circuit composed of a low-depth hardware-efficient ansatz as the parameterized policy of a Reinforcement Learning agent. We show that an $\epsilon$-approximation of the policy gradient can be obtained using a logarithmic number of samples concerning the total number of parameters. We empirically verify that such quantum models behave similarly or even outperform typical classical neural networks used in standard benchmarking environments and in quantum control, using only a fraction of the parameters. Moreover, we study the Barren Plateau phenomenon in quantum policy gradients using the Fisher Information Matrix spectrum.
    Sharing to learn and learning to share -- Fitting together Meta-Learning, Multi-Task Learning, and Transfer Learning: A meta review. (arXiv:2111.12146v5 [cs.LG] UPDATED)
    Integrating knowledge across different domains is an essential feature of human learning. Learning paradigms such as transfer learning, meta learning, and multi-task learning reflect the human learning process by exploiting the prior knowledge for new tasks, encouraging faster learning and good generalization for new tasks. This article gives a detailed view of these learning paradigms and their comparative analysis. The weakness of one learning algorithm turns out to be a strength of another, and thus merging them is a prevalent trait in the literature. There are numerous research papers that focus on each of these learning paradigms separately and provide a comprehensive overview of them. However, this article provides a review of research studies that combine (two of) these learning algorithms. This survey describes how these techniques are combined to solve problems in many different fields of study, including computer vision, natural language processing, hyperspectral imaging, and many more, in supervised setting only. As a result, the global generic learning network an amalgamation of meta learning, transfer learning, and multi-task learning is introduced here, along with some open research questions and future research directions in the multi-task setting.
    Privatized Graph Federated Learning. (arXiv:2203.07105v2 [cs.LG] UPDATED)
    Federated learning is a semi-distributed algorithm, where a server communicates with multiple dispersed clients to learn a global model. The federated architecture is not robust and is sensitive to communication and computational overloads due to its one-master multi-client structure. It can also be subject to privacy attacks targeting personal information on the communication links. In this work, we introduce graph federated learning (GFL), which consists of multiple federated units connected by a graph. We then show how graph homomorphic perturbations can be used to ensure the algorithm is differentially private. We conduct both convergence and privacy theoretical analyses and illustrate performance by means of computer simulations.
    Learning Partial Equivariances from Data. (arXiv:2110.10211v3 [cs.CV] UPDATED)
    Group Convolutional Neural Networks (G-CNNs) constrain learned features to respect the symmetries in the selected group, and lead to better generalization when these symmetries appear in the data. If this is not the case, however, equivariance leads to overly constrained models and worse performance. Frequently, transformations occurring in data can be better represented by a subset of a group than by a group as a whole, e.g., rotations in $[-90^{\circ}, 90^{\circ}]$. In such cases, a model that respects equivariance $\textit{partially}$ is better suited to represent the data. In addition, relevant transformations may differ for low and high-level features. For instance, full rotation equivariance is useful to describe edge orientations in a face, but partial rotation equivariance is better suited to describe face poses relative to the camera. In other words, the optimal level of equivariance may differ per layer. In this work, we introduce $\textit{Partial G-CNNs}$: G-CNNs able to learn layer-wise levels of partial and full equivariance to discrete, continuous groups and combinations thereof as part of training. Partial G-CNNs retain full equivariance when beneficial, e.g., for rotated MNIST, but adjust it whenever it becomes harmful, e.g., for classification of 6 / 9 digits or natural images. We empirically show that partial G-CNNs pair G-CNNs when full equivariance is advantageous, and outperform them otherwise.
    Variational Actor-Critic Algorithms. (arXiv:2108.01215v4 [cs.LG] UPDATED)
    We introduce a class of variational actor-critic algorithms based on a variational formulation over both the value function and the policy. The objective function of the variational formulation consists of two parts: one for maximizing the value function and the other for minimizing the Bellman residual. Besides the vanilla gradient descent with both the value function and the policy updates, we propose two variants, the clipping method and the flipping method, in order to speed up the convergence. We also prove that, when the prefactor of the Bellman residual is sufficiently large, the fixed point of the algorithm is close to the optimal policy.
    Smart Choices and the Selection Monad. (arXiv:2007.08926v7 [cs.LO] UPDATED)
    Describing systems in terms of choices and their resulting costs and rewards offers the promise of freeing algorithm designers and programmers from specifying how those choices should be made; in implementations, the choices can be realized by optimization techniques and, increasingly, by machine-learning methods. We study this approach from a programming-language perspective. We define two small languages that support decision-making abstractions: one with choices and rewards, and the other additionally with probabilities. We give both operational and denotational semantics. In the case of the second language we consider three denotational semantics, with varying degrees of correlation between possible program values and expected rewards. The operational semantics combine the usual semantics of standard constructs with optimization over spaces of possible execution strategies. The denotational semantics, which are compositional, rely on the selection monad, to handle choice, augmented with an auxiliary monad to handle other effects, such as rewards or probability. We establish adequacy theorems that the two semantics coincide in all cases. We also prove full abstraction at base types, with varying notions of observation in the probabilistic case corresponding to the various degrees of correlation. We present axioms for choice combined with rewards and probability, establishing completeness at base types for the case of rewards without probability.
    Elastic Similarity and Distance Measures for Multivariate Time Series. (arXiv:2102.10231v2 [cs.LG] UPDATED)
    This paper contributes multivariate versions of seven commonly used elastic similarity and distance measures for time series data analytics. Elastic similarity and distance measures are a class of similarity measures that can compensate for misalignments in the time axis of time series data. We adapt two existing strategies used in a multivariate version of the well-known Dynamic Time Warping (DTW), namely, Independent and Dependent DTW, to these seven measures. While these measures can be applied to various time series analysis tasks, we demonstrate their utility on multivariate time series classification using the nearest neighbor classifier. On 23 well-known datasets, we demonstrate that each of the measures but one achieves the highest accuracy relative to others on at least one dataset, supporting the value of developing a suite of multivariate similarity and distance measures. We also demonstrate that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. In addition, we also construct a nearest neighbor-based ensemble of the measures and show that it is competitive to other state-of-the-art single-strategy multivariate time series classifiers.
    SRECG: ECG Signal Super-resolution Framework for Portable/Wearable Devices in Cardiac Arrhythmias Classification. (arXiv:2012.03803v2 [eess.SP] UPDATED)
    A combination of cloud-based deep learning (DL) algorithms with portable/wearable (P/W) devices has been developed as a smart heath care system to support automatic cardiac arrhythmias (CAs) classification using electrocardiography (ECG). However, long-term and continuous ECG monitoring is challenging because of limitations of batteries and transmission bandwidth of P/W devices while incorporated with consumer electronics (CE). A feasible approach to address this challenge is to decrease sampling rates. However, low sampling rates lead to low-resolution signals that hinder the CAs classification performance. In this study, we propose a DL-based ECG signal super-resolution framework (called SRECG) to enhance low-resolution ECG signals by jointly considering the accuracies when applied to the DL-based high-resolution multiclass classifier (HMC) of CAs. In our experiments, we downsampled the ECG signals from the CPSC2018 dataset and evaluated their HMC accuracies with and without the SRECG. Experimental results show that SRECG can well improve the HMC accuracies as compared to traditional interpolation methods. Moreover, approximately half of the CAs classification accuracies of HMC were maintained within the enhanced ECG signals by SRECG. The promising results confirm that SRECG can be suitably used to enhance low-resolution ECG signals from P/W devices with CE to improve their cloud-based HMC performances.
    Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection. (arXiv:2006.14563v2 [cs.CV] UPDATED)
    It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose D3M, to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go. Source code, analysis data, and other details are publicly available at $\href{https://github.com/asvspoof/D3M}{\text{https://github.com/asvspoof/D3M}}$.
    Efficient anomaly detection method for rooftop PV systems using big data and permutation entropy. (arXiv:2301.06035v1 [cs.LG])
    The number of rooftop photovoltaic (PV) systems has significantly increased in recent years around the globe, including in Australia. This trend is anticipated to continue in the next few years. Given their high share of generation in power systems, detecting malfunctions and abnormalities in rooftop PV systems is essential for ensuring their high efficiency and safety. In this paper, we present a novel anomaly detection method for a large number of rooftop PV systems installed in a region using big data and a time series complexity measure called weighted permutation entropy (WPE). This efficient method only uses the historical PV generation data in a given region to identify anomalous PV systems and requires no new sensor or smart device. Using a real-world PV generation dataset, we discuss how the hyperparameters of WPE should be tuned for the purpose. The proposed PV anomaly detection method is then tested on rooftop PV generation data from over 100 South Australian households. The results demonstrate that anomalous systems detected by our method have indeed encountered problems and require a close inspection. The detection and resolution of potential faults would result in better rooftop PV systems, longer lifetimes, and higher returns on investment.
    Hawk: An Industrial-strength Multi-label Document Classifier. (arXiv:2301.06057v1 [cs.CL])
    There are a plethora of methods and algorithms that solve the classical multi-label document classification. However, when it comes to deployment and usage in an industry setting, most, if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i. ability to operate on variable-length texts and rambling documents. ii. catastrophic forgetting problem. iii. modularity when it comes to online learning and updating the model. iv. ability to spotlight relevant text while producing the prediction, i.e. visualizing the predictions. v. ability to operate on imbalanced or skewed datasets. vi. scalability. The paper describes the significance of these problems in detail and proposes a unique neural network architecture that addresses the above problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation. A hydranet-like architecture is designed to have granular control over and improve the modularity, coupled with a weighted loss driving task-specific heads. In particular, two specific mechanisms are compared: Bi-LSTM and Transformer-based. The architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model outperforms the existing methods by a substantial margin. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet.
    Deep Learning Provides Rapid Screen for Breast Cancer Metastasis with Sentinel Lymph Nodes. (arXiv:2301.05938v1 [cs.CV])
    Deep learning has been shown to be useful to detect breast cancer metastases by analyzing whole slide images of sentinel lymph nodes. However, it requires extensive scanning and analysis of all the lymph nodes slides for each case. Our deep learning study focuses on breast cancer screening with only a small set of image patches from any sentinel lymph node, positive or negative for metastasis, to detect changes in tumor environment and not in the tumor itself. We design a convolutional neural network in the Python language to build a diagnostic model for this purpose. The excellent results from this preliminary study provided a proof of concept for incorporating automated metastatic screen into the digital pathology workflow to augment the pathologists' productivity. Our approach is unique since it provides a very rapid screen rather than an exhaustive search for tumor in all fields of all sentinel lymph nodes.
    Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks. (arXiv:1906.04893v2 [cs.LG] UPDATED)
    Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is useful in many applications ranging from robustness certification of classifiers to stability analysis of closed-loop systems with reinforcement learning controllers. Existing methods in the literature for estimating the Lipschitz constant suffer from either lack of accuracy or poor scalability. In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. Our main idea is to interpret activation functions as gradients of convex potential functions. Hence, they satisfy certain properties that can be described by quadratic constraints. This particular description allows us to pose the Lipschitz constant estimation problem as a semidefinite program (SDP). The resulting SDP can be adapted to increase either the estimation accuracy (by capturing the interaction between activation functions of different layers) or scalability (by decomposition and parallel implementation). We illustrate the utility of our approach with a variety of experiments on randomly generated networks and on classifiers trained on the MNIST and Iris datasets. In particular, we experimentally demonstrate that our Lipschitz bounds are the most accurate compared to those in the literature. We also study the impact of adversarial training methods on the Lipschitz bounds of the resulting classifiers and show that our bounds can be used to efficiently provide robustness guarantees.
    Compress Then Test: Powerful Kernel Testing in Near-linear Time. (arXiv:2301.05974v1 [stat.ML])
    Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
    A data science and machine learning approach to continuous analysis of Shakespeare's plays. (arXiv:2301.06024v1 [cs.CL])
    The availability of quantitative methods that can analyze text has provided new ways of examining literature in a manner that was not available in the pre-information era. Here we apply comprehensive machine learning analysis to the work of William Shakespeare. The analysis shows clear change in style of writing over time, with the most significant changes in the sentence length, frequency of adjectives and adverbs, and the sentiments expressed in the text. Applying machine learning to make a stylometric prediction of the year of the play shows a Pearson correlation of 0.71 between the actual and predicted year, indicating that Shakespeare's writing style as reflected by the quantitative measurements changed over time. Additionally, it shows that the stylometrics of some of the plays is more similar to plays written either before or after the year they were written. For instance, Romeo and Juliet is dated 1596, but is more similar in stylometrics to plays written by Shakespeare after 1600. The source code for the analysis is available for free download.
    Transferring Fairness under Distribution Shifts via Fair Consistency Regularization. (arXiv:2206.12796v3 [cs.LG] UPDATED)
    The increasing reliance on ML models in high-stakes tasks has raised a major concern on fairness violations. Although there has been a surge of work that improves algorithmic fairness, most of them are under the assumption of an identical training and test distribution. In many real-world applications, however, such an assumption is often violated as previously trained fair models are often deployed in a different environment, and the fairness of such models has been observed to collapse. In this paper, we study how to transfer model fairness under distribution shifts, a widespread issue in practice. We conduct a fine-grained analysis of how the fair model is affected under different types of distribution shifts and find that domain shifts are more challenging than subpopulation shifts. Inspired by the success of self-training in transferring accuracy under domain shifts, we derive a sufficient condition for transferring group fairness. Guided by it, we propose a practical algorithm with a fair consistency regularization as the key component. A synthetic dataset benchmark, which covers all types of distribution shifts, is deployed for experimental verification of the theoretical findings. Experiments on synthetic and real datasets including image and tabular data demonstrate that our approach effectively transfers fairness and accuracy under various distribution shifts.
    Evaluating the Spectral Bias of Coordinate Based MLPs. (arXiv:2301.05816v1 [cs.LG])
    In recent years, representations given by fully connected neural networks have shown to represent scenes, objects, and other measurements well in dense low-dimensional settings. For these models, termed coordinate based MLPs, sinusoidal encodings are necessary in allowing for convergence to the high frequency components of the target function. This requirement is a result of their severe spectral bias when using dense, low dimensional coordinate based inputs. Previous work explained this phenomena using Neural Tangent Kernel (NTK) and Fourier analysis. While these methods provide insight towards this large spectral bias and the benefits of positional encoding, the properties of ReLU networks that induce this behavior are not fully determined. Analyzing spectral bias directly through the computations of ReLU networks would expose their limitations in dense settings, while providing a clearer explanation as to how this behavior emerges during the learning process. In this paper, we systematically analyze the spectral bias of a coordinate based MLP through its activation regions and gradient descent dynamics. This allows us to relate the network's expressive capacity to the speed at which gradient descent converges for components of varying frequency, and how the density of the data further restricts the model.
    What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. (arXiv:2208.01066v2 [cs.CL] UPDATED)
    In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at https://github.com/dtsip/in-context-learning .
    Reinforcement learning on graphs: A survey. (arXiv:2204.06127v4 [cs.LG] UPDATED)
    Graph mining tasks arise from many different application domains, ranging from social networks, transportation to E-commerce, etc., which have been receiving great attention from the theoretical and algorithmic design communities in recent years, and there has been some pioneering work employing the research-rich Reinforcement Learning (RL) techniques to address graph data mining tasks. However, these graph mining methods and RL models are dispersed in different research areas, which makes it hard to compare them. In this survey, we provide a comprehensive overview of RL and graph mining methods and generalize these methods to Graph Reinforcement Learning (GRL) as a unified formulation. We further discuss the applications of GRL methods across various domains and summarize the method descriptions, open-source codes, and benchmark datasets of GRL methods. Furthermore, we propose important directions and challenges to be solved in the future. As far as we know, this is the latest work on a comprehensive survey of GRL, this work provides a global view and a learning resource for scholars. In addition, we create an online open-source for both interested scholars who want to enter this rapidly developing domain and experts who would like to compare GRL methods.
    K-Deep Simplex: Deep Manifold Learning via Local Dictionaries. (arXiv:2012.02134v3 [cs.LG] UPDATED)
    We propose K-Deep Simplex (KDS) which, given a set of data points, learns a dictionary comprising synthetic landmarks, along with representation coefficients supported on a simplex. KDS integrates manifold learning and sparse coding/dictionary learning: reconstruction term, as in classical dictionary learning, and a novel local weighted $\ell_1$ penalty that encourages each data point to represent itself as a convex combination of nearby landmarks. We solve the proposed optimization program using alternating minimization and design an efficient, interpretable autoencoder using algorithm enrolling. We theoretically analyze the proposed program by relating the weighted $\ell_1$ penalty in KDS to a weighted $\ell_0$ program. Assuming that the data are generated from a Delaunay triangulation, we prove the equivalence of the weighted $\ell_1$ and weighted $\ell_0$ programs. If the representation coefficients are given, we prove that the resulting dictionary is unique. Further, we show that low-dimensional representations can be efficiently obtained from the covariance of the coefficient matrix. We apply KDS to the unsupervised clustering problem and prove theoretical performance guarantees. Experiments show that the algorithm is highly efficient and performs competitively on synthetic and real data sets.
    Salient Sign Detection In Safe Autonomous Driving: AI Which Reasons Over Full Visual Context. (arXiv:2301.05804v1 [cs.CV])
    Detecting road traffic signs and accurately determining how they can affect the driver's future actions is a critical task for safe autonomous driving systems. However, various traffic signs in a driving scene have an unequal impact on the driver's decisions, making detecting the salient traffic signs a more important task. Our research addresses this issue, constructing a traffic sign detection model which emphasizes performance on salient signs, or signs that influence the decisions of a driver. We define a traffic sign salience property and use it to construct the LAVA Salient Signs Dataset, the first traffic sign dataset that includes an annotated salience property. Next, we use a custom salience loss function, Salience-Sensitive Focal Loss, to train a Deformable DETR object detection model in order to emphasize stronger performance on salient signs. Results show that a model trained with Salience-Sensitive Focal Loss outperforms a model trained without, with regards to recall of both salient signs and all signs combined. Further, the performance margin on salient signs compared to all signs is largest for the model trained with Salience-Sensitive Focal Loss.  ( 2 min )
    MLOps: A Primer for Policymakers on a New Frontier in Machine Learning. (arXiv:2301.05775v1 [cs.LG])
    This chapter is written with the Data Scientist or MLOps professional in mind but can be used as a resource for policy makers, reformists, AI Ethicists, sociologists, and others interested in finding methods that help reduce bias in algorithms. I will take a deployment centered approach with the assumption that the professionals reading this work have already read the amazing work on the implications of algorithms on historically marginalized groups by Gebru, Buolamwini, Benjamin and Shane to name a few. If you have not read those works, I refer you to the "Important Reading for Ethical Model Building" list at the end of this paper as it will help give you a framework on how to think about Machine Learning models more holistically taking into account their effect on marginalized people. In the Introduction to this chapter, I root the significance of their work in real world examples of what happens when models are deployed without transparent data collected for the training process and are deployed without the practitioners paying special attention to what happens to models that adapt to exploit gaps between their training environment and the real world. The rest of this chapter builds on the work of the aforementioned researchers and discusses the reality of models performing post production and details ways ML practitioners can identify bias using tools during the MLOps lifecycle to mitigate bias that may be introduced to models in the real world.  ( 2 min )
    ML Approach for Power Consumption Prediction in Virtualized Base Stations. (arXiv:2301.05764v1 [cs.LG])
    The flexibility introduced with the Open Radio Access Network (O-RAN) architecture allows us to think beyond static configurations in all parts of the network. This paper addresses the issue related to predicting the power consumption of different radio schedulers, and the potential offered by O-RAN to collect data, train models, and deploy policies to control the power consumption. We propose a black-box (Neural Network) model to learn the power consumption function. We compare our approach with a known hand-crafted solution based on domain knowledge. Our solution reaches similar performance without any previous knowledge of the application and provides more flexibility in scenarios where the system behavior is not well understood or the domain knowledge is not available.  ( 2 min )
    FedSSC: Shared Supervised-Contrastive Federated Learning. (arXiv:2301.05797v1 [cs.LG])
    Federated learning is widely used to perform decentralized training of a global model on multiple devices while preserving the data privacy of each device. However, it suffers from heterogeneous local data on each training device which increases the difficulty to reach the same level of accuracy as the centralized training. Supervised Contrastive Learning which outperform cross-entropy tries to minimizes the difference between feature space of points belongs to the same class and pushes away points from different classes. We propose Supervised Contrastive Federated Learning in which devices can share the learned class-wise feature spaces with each other and add the supervised-contrastive learning loss as a regularization term to foster the feature space learning. The loss tries to minimize the cosine similarity distance between the feature map and the averaged feature map from another device in the same class and maximizes the distance between the feature map and that in a different class. This new regularization term when added on top of the moon regularization term is found to outperform the other state-of-the-art regularization terms in solving the heterogeneous data distribution problem.  ( 2 min )
    A domain-decomposed VAE method for Bayesian inverse problems. (arXiv:2301.05708v1 [stat.ML])
    Bayesian inverse problems are often computationally challenging when the forward model is governed by complex partial differential equations (PDEs). This is typically caused by expensive forward model evaluations and high-dimensional parameterization of priors. This paper proposes a domain-decomposed variational auto-encoder Markov chain Monte Carlo (DD-VAE-MCMC) method to tackle these challenges simultaneously. Through partitioning the global physical domain into small subdomains, the proposed method first constructs local deterministic generative models based on local historical data, which provide efficient local prior representations. Gaussian process models with active learning address the domain decomposition interface conditions. Then inversions are conducted on each subdomain independently in parallel and in low-dimensional latent parameter spaces. The local inference solutions are post-processed through the Poisson image blending procedure to result in an efficient global inference result. Numerical examples are provided to demonstrate the performance of the proposed method.  ( 2 min )
    Eco-PiNN: A Physics-informed Neural Network for Eco-toll Estimation. (arXiv:2301.05739v1 [cs.LG])
    The eco-toll estimation problem quantifies the expected environmental cost (e.g., energy consumption, exhaust emissions) for a vehicle to travel along a path. This problem is important for societal applications such as eco-routing, which aims to find paths with the lowest exhaust emissions or energy need. The challenges of this problem are three-fold: (1) the dependence of a vehicle's eco-toll on its physical parameters; (2) the lack of access to data with eco-toll information; and (3) the influence of contextual information (i.e. the connections of adjacent segments in the path) on the eco-toll of road segments. Prior work on eco-toll estimation has mostly relied on pure data-driven approaches and has high estimation errors given the limited training data. To address these limitations, we propose a novel Eco-toll estimation Physics-informed Neural Network framework (Eco-PiNN) using three novel ideas, namely, (1) a physics-informed decoder that integrates the physical laws of the vehicle engine into the network, (2) an attention-based contextual information encoder, and (3) a physics-informed regularization to reduce overfitting. Experiments on real-world heavy-duty truck data show that the proposed method can greatly improve the accuracy of eco-toll estimation compared with state-of-the-art methods.  ( 2 min )
    Diatom-inspired architected materials using language-based deep learning: Perception, transformation and manufacturing. (arXiv:2301.05875v1 [cond-mat.mtrl-sci])
    Learning from nature has been a quest of humanity for millennia. While this has taken the form of humans assessing natural designs such as bones, butterfly wings, or spider webs, we can now achieve generating designs using advanced computational algorithms. In this paper we report novel biologically inspired designs of diatom structures, enabled using transformer neural networks, using natural language models to learn, process and transfer insights across manifestations. We illustrate a series of novel diatom-based designs and also report a manufactured specimen, created using additive manufacturing. The method applied here could be expanded to focus on other biological design cues, implement a systematic optimization to meet certain design targets, and include a hybrid set of material design sets.  ( 2 min )
    Artificial Benchmark for Community Detection with Outliers (ABCD+o). (arXiv:2301.05749v1 [cs.SI])
    The Artificial Benchmark for Community Detection graph (ABCD) is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs with similar properties as the well-known LFR one, and its main parameter $\xi$ can be tuned to mimic its counterpart in the LFR model, the mixing parameter $\mu$. In this paper, we extend the ABCD model to include potential outliers. We perform some exploratory experiments on both the new ABCD+o model as well as a real-world network to show that outliers possess some desired, distinguishable properties.  ( 2 min )
    Efficient Activation Function Optimization through Surrogate Modeling. (arXiv:2301.05785v1 [cs.LG])
    Carefully designed activation functions can improve the performance of neural networks in many machine learning tasks. However, it is difficult for humans to construct optimal activation functions, and current activation function search algorithms are prohibitively expensive. This paper aims to improve the state of the art through three steps: First, the benchmark datasets Act-Bench-CNN, Act-Bench-ResNet, and Act-Bench-ViT were created by training convolutional, residual, and vision transformer architectures from scratch with 2,913 systematically generated activation functions. Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization. More specifically, the spectrum of the Fisher information matrix associated with the model's predictive distribution at initialization and the activation function's output distribution were found to be highly predictive of performance. Third, the surrogate was used to discover improved activation functions in CIFAR-100 and ImageNet tasks. Each of these steps is a contribution in its own right; together they serve as a practical and theoretical foundation for further research on activation function optimization. Code is available at https://github.com/cognizant-ai-labs/aquasurf, and the benchmark datasets are at https://github.com/cognizant-ai-labs/act-bench.  ( 2 min )
    Fairness and Sequential Decision Making: Limits, Lessons, and Opportunities. (arXiv:2301.05753v1 [cs.CY])
    As automated decision making and decision assistance systems become common in everyday life, research on the prevention or mitigation of potential harms that arise from decisions made by these systems has proliferated. However, various research communities have independently conceptualized these harms, envisioned potential applications, and proposed interventions. The result is a somewhat fractured landscape of literature focused generally on ensuring decision-making algorithms "do the right thing". In this paper, we compare and discuss work across two major subsets of this literature: algorithmic fairness, which focuses primarily on predictive systems, and ethical decision making, which focuses primarily on sequential decision making and planning. We explore how each of these settings has articulated its normative concerns, the viability of different techniques for these different settings, and how ideas from each setting may have utility for the other.  ( 2 min )
  • Open

    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v5 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Unbalanced Optimal Transport, from Theory to Numerics. (arXiv:2211.08775v2 [stat.ML] UPDATED)
    Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.  ( 2 min )
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v2 [math.OC] UPDATED)
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the nonconvex minimax problems with coupled linear constraints.  ( 2 min )
    Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data. (arXiv:2210.08642v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.  ( 2 min )
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v2 [math.OC] UPDATED)
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.  ( 2 min )
    Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms. (arXiv:2209.00735v2 [cs.LG] UPDATED)
    Neural networks (NNs) struggle to efficiently solve certain problems, such as learning parities, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized program. For example, on parity problems, the NN learns as well as Gaussian elimination, an efficient algorithm that can be succinctly described. Our architecture combines both recurrent weight sharing between layers and convolutional weight sharing to reduce the number of parameters down to a constant, even though the network itself may have trillions of nodes. While in practice the constants in our analysis are too large to be directly meaningful, our work suggests that the synergy of Recurrent and Convolutional NNs (RCNNs) may be more natural and powerful than either alone, particularly for concisely parameterizing discrete algorithms.  ( 2 min )
    Failure-informed adaptive sampling for PINNs. (arXiv:2210.00279v3 [math.NA] UPDATED)
    Physics-informed neural networks (PINNs) have emerged as an effective technique for solving PDEs in a wide range of domains. It is noticed, however, the performance of PINNs can vary dramatically with different sampling procedures. For instance, a fixed set of (prior chosen) training points may fail to capture the effective solution region (especially for problems with singularities). To overcome this issue, we present in this work an adaptive strategy, termed the failure-informed PINNs (FI-PINNs), which is inspired by the viewpoint of reliability analysis. The key idea is to define an effective failure probability based on the residual, and then, with the aim of placing more samples in the failure region, the FI-PINNs employs a failure-informed enrichment technique to adaptively add new collocation points to the training set, such that the numerical accuracy is dramatically improved. In short, similar as adaptive finite element methods, the proposed FI-PINNs adopts the failure probability as the posterior error indicator to generate new training points. We prove rigorous error bounds of FI-PINNs and illustrate its performance through several problems.  ( 2 min )
    Nonlinear Independent Component Analysis for Discrete-Time and Continuous-Time Signals. (arXiv:2102.02876v3 [stat.ML] UPDATED)
    We study the classical problem of recovering a multidimensional source signal from observations of nonlinear mixtures of this signal. We show that this recovery is possible (up to a permutation and monotone scaling of the source's original component signals) if the mixture is due to a sufficiently differentiable and invertible but otherwise arbitrarily nonlinear function and the component signals of the source are statistically independent with 'non-degenerate' second-order statistics. The latter assumption requires the source signal to meet one of three regularity conditions which essentially ensure that the source is sufficiently far away from the non-recoverable extremes of being deterministic or constant in time. These assumptions, which cover many popular time series models and stochastic processes, allow us to reformulate the initial problem of nonlinear blind source separation as a simple-to-state problem of optimisation-based function approximation. We propose to solve this approximation problem by minimizing a novel type of objective function that efficiently quantifies the mutual statistical dependence between multiple stochastic processes via cumulant-like statistics. This yields a scalable and direct new method for nonlinear Independent Component Analysis with widely applicable theoretical guarantees and for which our experiments indicate good performance.  ( 2 min )
    Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. (arXiv:2106.13624v2 [cs.LG] UPDATED)
    We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev- Cantelli inequality (a.k.a. one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive a new concentration of measure inequality, which we name PAC-Bayes-Bennett, since it combines PAC-Bayesian bounding with Bennett's inequality. We use it for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality of Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work of Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.  ( 2 min )
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v3 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.  ( 2 min )
    Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection. (arXiv:2006.14563v2 [cs.CV] UPDATED)
    It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose D3M, to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go. Source code, analysis data, and other details are publicly available at $\href{https://github.com/asvspoof/D3M}{\text{https://github.com/asvspoof/D3M}}$.  ( 3 min )
    Elastic Similarity and Distance Measures for Multivariate Time Series. (arXiv:2102.10231v2 [cs.LG] UPDATED)
    This paper contributes multivariate versions of seven commonly used elastic similarity and distance measures for time series data analytics. Elastic similarity and distance measures are a class of similarity measures that can compensate for misalignments in the time axis of time series data. We adapt two existing strategies used in a multivariate version of the well-known Dynamic Time Warping (DTW), namely, Independent and Dependent DTW, to these seven measures. While these measures can be applied to various time series analysis tasks, we demonstrate their utility on multivariate time series classification using the nearest neighbor classifier. On 23 well-known datasets, we demonstrate that each of the measures but one achieves the highest accuracy relative to others on at least one dataset, supporting the value of developing a suite of multivariate similarity and distance measures. We also demonstrate that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. In addition, we also construct a nearest neighbor-based ensemble of the measures and show that it is competitive to other state-of-the-art single-strategy multivariate time series classifiers.  ( 2 min )
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v2 [cs.CV] UPDATED)
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.  ( 2 min )
    Improved Algorithms for Neural Active Learning. (arXiv:2210.00423v3 [cs.LG] UPDATED)
    We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. In particular, we introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work. Then, the proposed algorithm leverages the powerful representation of NNs for both exploitation and exploration, has the query decision-maker tailored for $k$-class classification problems with the performance guarantee, utilizes the full feedback, and updates parameters in a more practical and efficient manner. These careful designs lead to an instance-dependent regret upper bound, roughly improving by a multiplicative factor $O(\log T)$ and removing the curse of input dimensionality. Furthermore, we show that the algorithm can achieve the same performance as the Bayes-optimal classifier in the long run under the hard-margin setting in classification problems. In the end, we use extensive experiments to evaluate the proposed algorithm and SOTA baselines, to show the improved empirical performance.  ( 2 min )
    Generic Error Bounds for the Generalized Lasso with Sub-Exponential Data. (arXiv:2004.05361v3 [math.ST] UPDATED)
    This work performs a non-asymptotic analysis of the generalized Lasso under the assumption of sub-exponential data. Our main results continue recent research on the benchmark case of (sub-)Gaussian sample distributions and thereby explore what conclusions are still valid when going beyond. While many statistical features remain unaffected (e.g., consistency and error decay rates), the key difference becomes manifested in how the complexity of the hypothesis set is measured. It turns out that the estimation error can be controlled by means of two complexity parameters that arise naturally from a generic-chaining-based proof strategy. The output model can be non-realizable, while the only requirement for the input vector is a generic concentration inequality of Bernstein-type, which can be implemented for a variety of sub-exponential distributions. This abstract approach allows us to reproduce, unify, and extend previously known guarantees for the generalized Lasso. In particular, we present applications to semi-parametric output models and phase retrieval via the lifted Lasso. Moreover, our findings are discussed in the context of sparse recovery and high-dimensional estimation problems.  ( 2 min )
    Black-box Coreset Variational Inference. (arXiv:2211.02377v2 [stat.ML] UPDATED)
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.  ( 2 min )
    On the Exactness of Dantzig-Wolfe Relaxation for Rank Constrained Optimization Problems. (arXiv:2210.16191v2 [math.OC] UPDATED)
    In the rank-constrained optimization problem (RCOP), it minimizes a linear objective function over a prespecified closed rank-constrained domain set and $m$ generic two-sided linear matrix inequalities. Motivated by the Dantzig-Wolfe (DW) decomposition, a popular approach of solving many nonconvex optimization problems, we investigate the strength of DW relaxation (DWR) of the RCOP, which admits the same formulation as RCOP except replacing the domain set by its closed convex hull. Notably, our goal is to characterize conditions under which the DWR matches RCOP for any m two-sided linear matrix inequalities. From the primal perspective, we develop the first-known simultaneously necessary and sufficient conditions that achieve: (i) extreme point exactness -- all the extreme points of the DWR feasible set belong to that of the RCOP; (ii) convex hull exactness -- the DWR feasible set is identical to the closed convex hull of RCOP feasible set; and (iii) objective exactness -- the optimal values of the DWR and RCOP coincide. The proposed conditions unify, refine, and extend the existing exactness results in the quadratically constrained quadratic program (QCQP) and fair unsupervised learning. These conditions can be very useful to identify new results, including the extreme point exactness for a QCQP problem that admits an inhomogeneous objective function with two homogeneous two-sided quadratic constraints and the convex hull exactness for fair SVD.  ( 2 min )
    Outlier Robust and Sparse Estimation of Linear Regression Coefficients. (arXiv:2208.11592v2 [math.ST] UPDATED)
    We consider outlier-robust and sparse estimation of linear regression coefficients, when covariate vectors and noises are sampled, respectively, from an $\mathfrak{L}$-subGaussian distribution and a heavy-tailed distribution. Additionally, the covariate vectors and noises are contaminated by adversarial outliers. We deal with two cases: the covariance matrix of the covariates is known or unknown. Particularly, in the known case, our estimator can attain a nearly information theoretical optimal error bound, and our error bound is sharper than those of earlier studies dealing with similar situations. Our estimator analysis relies heavily on generic chaining to derive sharp error bounds.  ( 2 min )
    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v3 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.  ( 2 min )
    Relay Variational Inference: A Method for Accelerated Encoderless VI. (arXiv:2110.13422v2 [cs.LG] UPDATED)
    Variational Inference (VI) offers a method for approximating intractable likelihoods. In neural VI, inference of approximate posteriors is commonly done using an encoder. Alternatively, encoderless VI offers a framework for learning generative models from data without encountering suboptimalities caused by amortization via an encoder (e.g. in presence of missing or uncertain data). However, in absence of an encoder, such methods often suffer in convergence due to the slow nature of gradient steps required to learn the approximate posterior parameters. In this paper, we introduce Relay VI (RVI), a framework that dramatically improves both the convergence and performance of encoderless VI. In our experiments over multiple datasets, we study the effectiveness of RVI in terms of convergence speed, loss, representation power and missing data imputation. We find RVI to be a unique tool, often superior in both performance and convergence speed to previously proposed encoderless as well as amortized VI models (e.g. VAE).  ( 2 min )
    Spectrum of non-Hermitian deep-Hebbian neural networks. (arXiv:2208.11411v2 [q-bio.NC] UPDATED)
    Neural networks with recurrent asymmetric couplings are important to understand how episodic memories are encoded in the brain. Here, we integrate the experimental observation of wide synaptic integration window into our model of sequence retrieval in the continuous time dynamics. The model with non-normal neuron-interactions is theoretically studied by deriving a random matrix theory of the Jacobian matrix in neural dynamics. The spectra bears several distinct features, such as breaking rotational symmetry about the origin, and the emergence of nested voids within the spectrum boundary. The spectral density is thus highly non-uniformly distributed in the complex plane. The random matrix theory also predicts a transition to chaos. In particular, the edge of chaos provides computational benefits for the sequential retrieval of memories. Our work provides a systematic study of time-lagged correlations with arbitrary time delays, and thus can inspire future studies of a broad class of memory models, and even big data analysis of biological time series.  ( 2 min )
    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies. (arXiv:2204.01058v2 [math.PR] UPDATED)
    This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$.  ( 2 min )
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v3 [cs.LG] UPDATED)
    There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-sparse parity of $n$ bits, a canonical discrete search problem which is statistically easy but computationally hard. Empirically, we find that a variety of neural networks successfully learn sparse parities, with discontinuous phase transitions in the training curves. On small instances, learning abruptly occurs at approximately $n^{O(k)}$ iterations; this nearly matches SQ lower bounds, despite the apparent lack of a sparse prior. Our theoretical analysis shows that these observations are not explained by a Langevin-like mechanism, whereby SGD "stumbles in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time). Instead, we show that SGD gradually amplifies the sparse solution via a Fourier gap in the population gradient, making continual progress that is invisible to loss and error metrics.  ( 2 min )
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v4 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.  ( 2 min )
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v3 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.  ( 2 min )
    AutoML Two-Sample Test. (arXiv:2206.08843v3 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.  ( 2 min )
    The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization. (arXiv:2205.14546v2 [cs.LG] UPDATED)
    Machine learning models often generalize poorly to out-of-distribution (OOD) data as a result of relying on features that are spuriously correlated with the label during training. Recently, the technique of Invariant Risk Minimization (IRM) was proposed to learn predictors that only use invariant features by conserving the feature-conditioned label expectation $\mathbb{E}_e[y|f(x)]$ across environments. However, more recent studies have demonstrated that IRM-v1, a practical version of IRM, can fail in various settings. Here, we identify a fundamental flaw of IRM formulation that causes the failure. We then introduce a complementary notion of invariance, MRI, based on conserving the label-conditioned feature expectation $\mathbb{E}_e[f(x)|y]$, which is free of this flaw. Further, we introduce a simplified, practical version of the MRI formulation called MRI-v1. We prove that for general linear problems, MRI-v1 guarantees invariant predictors given sufficient number of environments. We also empirically demonstrate that MRI-v1 strongly out-performs IRM-v1 and consistently achieves near-optimal OOD generalization in image-based nonlinear problems.  ( 2 min )
    RenyiCL: Contrastive Representation Learning with Skew Renyi Divergence. (arXiv:2208.06270v2 [stat.ML] UPDATED)
    Contrastive representation learning seeks to acquire useful representations by estimating the shared information between multiple views of data. Here, the choice of data augmentation is sensitive to the quality of learned representations: as harder the data augmentations are applied, the views share more task-relevant information, but also task-irrelevant one that can hinder the generalization capability of representation. Motivated by this, we present a new robust contrastive learning scheme, coined R\'enyiCL, which can effectively manage harder augmentations by utilizing R\'enyi divergence. Our method is built upon the variational lower bound of R\'enyi divergence, but a na\"ive usage of a variational method is impractical due to the large variance. To tackle this challenge, we propose a novel contrastive objective that conducts variational estimation of a skew R\'enyi divergence and provide a theoretical guarantee on how variational estimation of skew divergence leads to stable training. We show that R\'enyi contrastive learning objectives perform innate hard negative sampling and easy positive sampling simultaneously so that it can selectively learn useful features and ignore nuisance features. Through experiments on ImageNet, we show that R\'enyi contrastive learning with stronger augmentations outperforms other self-supervised methods without extra regularization or computational overhead. Moreover, we also validate our method on other domains such as graph and tabular, showing empirical gain over other contrastive methods.  ( 2 min )
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v3 [cs.LG] UPDATED)
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.  ( 3 min )
    coVariance Neural Networks. (arXiv:2205.15856v4 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.  ( 2 min )
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v3 [cond-mat.stat-mech] UPDATED)
    We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo (NMCMC) simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We discuss several estimates of autocorrelation times in the context of NMCMC, some inspired by analytical results derived for the Metropolized Independent Sampler (MIS). We check their reliability by estimating them on a small system where analytical results can also be obtained. Based on the analytical results for MIS we propose a new loss function and study its impact on the autocorelation times. Although, this function's performance is a bit inferior to the traditional Kullback-Leibler divergence, it offers two training algorithms which in some situations may be beneficial. By studying a small, $4 \times 4$, system we gain access to the dynamics of the training process which we visualize using several observables. Furthermore, we quantitatively investigate the impact of imposing global discrete symmetries of the system in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates which considerably improves the quality of the training. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guidance to the implementation of Neural Markov Chain Monte Carlo simulations for more complicated models.  ( 2 min )
    Split-kl and PAC-Bayes-split-kl Inequalities for Ternary Random Variables. (arXiv:2206.00706v2 [stat.ML] UPDATED)
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality is particularly well-suited for ternary random variables, which naturally show up in a variety of problems, including analysis of excess losses in classification, analysis of weighted majority votes, and learning with abstention. We demonstrate that for ternary random variables the inequality is simultaneously competitive with the kl inequality, the Empirical Bernstein inequality, and the Unexpected Bernstein inequality, and in certain regimes outperforms all of them. It resolves an open question by Tolstikhin and Seldin [2013] and Mhammedi et al. [2019] on how to match simultaneously the combinatorial power of the kl inequality when the distribution happens to be close to binary and the power of Bernstein inequalities to exploit low variance when the probability mass is concentrated on the middle value. We also derive a PAC-Bayes-split-kl inequality and compare it with the PAC-Bayes-kl, PAC-Bayes-Empirical-Bennett, and PAC-Bayes-Unexpected-Bernstein inequalities in an analysis of excess losses and in an analysis of a weighted majority vote for several UCI datasets. Last but not least, our study provides the first direct comparison of the Empirical Bernstein and Unexpected Bernstein inequalities and their PAC-Bayes extensions.  ( 2 min )
    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v2 [cs.LG] UPDATED)
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D molecular graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D molecular graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.  ( 2 min )
    Recipes for when Physics Fails: Recovering Robust Learning of Physics Informed Neural Networks. (arXiv:2110.13330v2 [cs.LG] UPDATED)
    Physics-informed Neural Networks (PINNs) have been shown to be effective in solving partial differential equations by capturing the physics induced constraints as a part of the training loss function. This paper shows that a PINN can be sensitive to errors in training data and overfit itself in dynamically propagating these errors over the domain of the solution of the PDE. It also shows how physical regularizations based on continuity criteria and conservation laws fail to address this issue and rather introduce problems of their own causing the deep network to converge to a physics-obeying local minimum instead of the global minimum. We introduce Gaussian Process (GP) based smoothing that recovers the performance of a PINN and promises a robust architecture against noise/errors in measurements. Additionally, we illustrate an inexpensive method of quantifying the evolution of uncertainty based on the variance estimation of GPs on boundary data. Robust PINN performance is also shown to be achievable by choice of sparse sets of inducing points based on sparsely induced GPs. We demonstrate the performance of our proposed methods and compare the results from existing benchmark models in literature for time-dependent Schr\"odinger and Burgers' equations.  ( 2 min )
    Sinkhorn Divergences for Unbalanced Optimal Transport. (arXiv:1910.12958v3 [math.OC] UPDATED)
    Optimal transport induces the Earth Mover's (Wasserstein) distance between probability distributions, a geometric divergence that is relevant to a wide range of problems. Over the last decade, two relaxations of optimal transport have been studied in depth: unbalanced transport, which is robust to the presence of outliers and can be used when distributions don't have the same total mass; entropy-regularized transport, which is robust to sampling noise and lends itself to fast computations using the Sinkhorn algorithm. This paper combines both lines of work to put robust optimal transport on solid ground. Our main contribution is a generalization of the Sinkhorn algorithm to unbalanced transport: our method alternates between the standard Sinkhorn updates and the pointwise application of a contractive function. This implies that entropic transport solvers on grid images, point clouds and sampled distributions can all be modified easily to support unbalanced transport, with a proof of linear convergence that holds in all settings. We then show how to use this method to define pseudo-distances on the full space of positive measures that satisfy key geometric axioms: (unbalanced) Sinkhorn divergences are differentiable, positive, definite, convex, statistically robust and avoid any "entropic bias" towards a shrinkage of the measures' supports.  ( 2 min )
    Generalization Error Bounds for Multiclass Sparse Linear Classifiers. (arXiv:2204.06264v2 [math.ST] UPDATED)
    We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.  ( 2 min )
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v3 [cs.LG] UPDATED)
    While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.  ( 2 min )
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v3 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.  ( 2 min )
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v5 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.  ( 2 min )
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v3 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.  ( 2 min )
    Toward Explainable AI for Regression Models. (arXiv:2112.11407v2 [cs.LG] UPDATED)
    In addition to the impressive predictive power of machine learning (ML) models, more recently, explanation methods have emerged that enable an interpretation of complex non-linear learning models such as deep neural networks. Gaining a better understanding is especially important e.g. for safety-critical ML applications or medical diagnostics etc. While such Explainable AI (XAI) techniques have reached significant popularity for classifiers, so far little attention has been devoted to XAI for regression models (XAIR). In this review, we clarify the fundamental conceptual differences of XAI for regression and classification tasks, establish novel theoretical insights and analysis for XAIR, provide demonstrations of XAIR on genuine practical regression problems, and finally discuss the challenges remaining for the field.  ( 2 min )
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v4 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.  ( 2 min )
    Fast Bayesian Coresets via Subsampling and Quasi-Newton Refinement. (arXiv:2203.09675v3 [stat.ML] UPDATED)
    Bayesian coresets approximate a posterior distribution by building a small weighted subset of the data points. Any inference procedure that is too computationally expensive to be run on the full posterior can instead be run inexpensively on the coreset, with results that approximate those on the full data. However, current approaches are limited by either a significant run-time or the need for the user to specify a low-cost approximation to the full posterior. We propose a Bayesian coreset construction algorithm that first selects a uniformly random subset of data, and then optimizes the weights using a novel quasi-Newton method. Our algorithm is a simple to implement, black-box method, that does not require the user to specify a low-cost posterior approximation. It is the first to come with a general high-probability bound on the KL divergence of the output coreset posterior. Experiments demonstrate that our method provides significant improvements in coreset quality against alternatives with comparable construction times, with far less storage cost and user input required.  ( 2 min )
    Forecasting Market Changes using Variational Inference. (arXiv:2205.00605v2 [q-fin.ST] UPDATED)
    Though various approaches have been considered, forecasting near-term market changes of equities and similar market data remains quite difficult. In this paper we introduce an approach to forecast near-term market changes for equity indices as well as portfolios using variational inference (VI). VI is a machine learning approach which uses optimization techniques to estimate complex probability densities. In the proposed approach, clusters of explanatory variables are identified and market changes are forecast based on cluster-specific linear regression. Apart from the expected value of changes, the proposed approach can also be used to obtain the distribution of possible outcomes. Another advantage of the proposed approach is the clear model interpretation, as clusters of explanatory variables (or market regimes) are identified for which the future changes follow similar relationships. Knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. An illustrative example of predicting one-day S\&P change is considered and it is shown that even with as few as three explanatory variables, the proposed approach provides useful predictions.  ( 2 min )
    Random Planted Forest: a directly interpretable tree ensemble. (arXiv:2012.14563v2 [stat.ML] UPDATED)
    We introduce a novel interpretable, tree based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional decomposition perspective, where each tree corresponds to a function within that decomposition. The maximal order of approximation in the decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model places no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in cases where the maximal order of approximation is low. We show that if the order is smaller than three, the idealised version achieves asymptotically optimal convergence rates up to a logarithmic factor. ode is available on https://github.com/PlantedML/randomPlantedForest  ( 2 min )
    The Unbalanced Gromov Wasserstein Distance: Conic Formulation and Relaxation. (arXiv:2009.04266v3 [math.OC] UPDATED)
    Comparing metric measure spaces (i.e. a metric space endowed with aprobability distribution) is at the heart of many machine learning problems. The most popular distance between such metric measure spaces is theGromov-Wasserstein (GW) distance, which is the solution of a quadratic assignment problem. The GW distance is however limited to the comparison of metric measure spaces endowed with a probability distribution. To alleviate this issue, we introduce two Unbalanced Gromov-Wasserstein formulations: a distance and a more tractable upper-bounding relaxation.They both allow the comparison of metric spaces equipped with arbitrary positive measures up to isometries. The first formulation is a positive and definite divergence based on a relaxation of the mass conservation constraint using a novel type of quadratically-homogeneous divergence. This divergence works hand in hand with the entropic regularization approach which is popular to solve large scale optimal transport problems. We show that the underlying non-convex optimization problem can be efficiently tackled using a highly parallelizable and GPU-friendly iterative scheme. The second formulation is a distance between mm-spaces up to isometries based on a conic lifting. Lastly, we provide numerical experiments onsynthetic examples and domain adaptation data with a Positive-Unlabeled learning task to highlight the salient features of the unbalanced divergence and its potential applications in ML.  ( 2 min )
    A Concentration of Measure Framework to study convex problems and other implicit formulation problems in machine learning. (arXiv:2010.09877v2 [math.PR] UPDATED)
    This paper provides a framework to show the concentration of solutions $Y^*$ to convex minimizing problem where the objective function $\phi(X)(Y)$ depends on some random vector $X$ satisfying concentration of measure hypotheses. More precisely, the convex problem translates into a contractive fixed point equation that ensure the transmission of the concentration from $X$ to $Y^*$. This result is of central interest to characterize many machine learning algorithms which are defined through implicit equations (e.g., logistic regression, lasso, boosting, etc.). Based on our framework, we provide precise estimations for the first moments of the solution $Y^*$, when $X= (x_1,\ldots, x_n)$ is a data matrix of independent columns and $\phi(X)(y)$ writes as a sum $\frac{1}{n}\sum_{i=1}^n h_i(x_i^TY)$. That allows to describe the behavior and performance (e.g., generalization error) of a wide variety of machine learning classifiers.  ( 2 min )
    Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks. (arXiv:2007.01498v2 [cs.AI] UPDATED)
    In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.  ( 2 min )
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v5 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search, and also on a real-world YouTube dataset. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.  ( 2 min )
    Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms. (arXiv:2006.14514v4 [cs.LG] UPDATED)
    Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. In many cases, the gradient of any such loss function has superlinear growth, making the use of the widely-accepted (stochastic) gradient descent methods, which are based on Euler numerical schemes, problematic. We offer a new learning algorithm based on an appropriately constructed variant of the popular stochastic gradient Langevin dynamics (SGLD), which is called tamed unadjusted stochastic Langevin algorithm (TUSLA). We also provide a nonasymptotic analysis of the new algorithm's convergence properties in the context of non-convex learning problems with the use of ANNs. Thus, we provide finite-time guarantees for TUSLA to find approximate minimizers of both empirical and population risks. The roots of the TUSLA algorithm are based on the taming technology for diffusion processes with superlinear coefficients as developed in \citet{tamed-euler, SabanisAoAP} and for MCMC algorithms in \citet{tula}. Numerical experiments are presented which confirm the theoretical findings and illustrate the need for the use of the new algorithm in comparison to vanilla SGLD within the framework of ANNs.  ( 2 min )
    Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning. (arXiv:1909.05850v6 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.  ( 2 min )
    Decentralized Exploration in Multi-Armed Bandits -- Extended version. (arXiv:1811.07763v6 [cs.LG] UPDATED)
    We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm Decentralized Elimination, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Then, thanks to the genericity of the approach, we extend the proposed algorithm to the non-stationary bandits. Finally, experiments illustrate and complete the analysis.  ( 2 min )
    Martingale Methods for Sequential Estimation of Convex Functionals and Divergences. (arXiv:2103.09267v3 [math.ST] UPDATED)
    We present a unified technique for sequential estimation of convex divergences between distributions, including integral probability metrics like the kernel maximum mean discrepancy, $\varphi$-divergences like the Kullback-Leibler divergence, and optimal transport costs, such as powers of Wasserstein distances. This is achieved by observing that empirical convex divergences are (partially ordered) reverse submartingales with respect to the exchangeable filtration, coupled with maximal inequalities for such processes. These techniques appear to be complementary and powerful additions to the existing literature on both confidence sequences and convex divergences. We construct an offline-to-sequential device that converts a wide array of existing offline concentration inequalities into time-uniform confidence sequences that can be continuously monitored, providing valid tests or confidence intervals at arbitrary stopping times. The resulting sequential bounds pay only an iterated logarithmic price over the corresponding fixed-time bounds, retaining the same dependence on problem parameters (like dimension or alphabet size if applicable). These results are also applicable to more general convex functionals, like the negative differential entropy, suprema of empirical processes, and V-Statistics.  ( 2 min )
    Robust Max Entrywise Error Bounds for Tensor Estimation from Sparse Observations via Similarity Based Collaborative Filtering. (arXiv:1908.01241v4 [cs.LG] UPDATED)
    Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for estimating a tensor from sparse observations and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a finite rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \kappa})$ for any arbitrarily small $\kappa > 0$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\varepsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by $\text{poly}(\varepsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\varepsilon \geq 0$ by a finite rank tensor, then the estimation error converges to $\text{poly}(\varepsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.  ( 2 min )
    Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks. (arXiv:1906.04893v2 [cs.LG] UPDATED)
    Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is useful in many applications ranging from robustness certification of classifiers to stability analysis of closed-loop systems with reinforcement learning controllers. Existing methods in the literature for estimating the Lipschitz constant suffer from either lack of accuracy or poor scalability. In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. Our main idea is to interpret activation functions as gradients of convex potential functions. Hence, they satisfy certain properties that can be described by quadratic constraints. This particular description allows us to pose the Lipschitz constant estimation problem as a semidefinite program (SDP). The resulting SDP can be adapted to increase either the estimation accuracy (by capturing the interaction between activation functions of different layers) or scalability (by decomposition and parallel implementation). We illustrate the utility of our approach with a variety of experiments on randomly generated networks and on classifiers trained on the MNIST and Iris datasets. In particular, we experimentally demonstrate that our Lipschitz bounds are the most accurate compared to those in the literature. We also study the impact of adversarial training methods on the Lipschitz bounds of the resulting classifiers and show that our bounds can be used to efficiently provide robustness guarantees.  ( 2 min )
    On the role of Model Uncertainties in Bayesian Optimization. (arXiv:2301.05983v1 [stat.ML])
    Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and compare them across both synthetic and real-world experiments. Our results confirm that Gaussian Processes are strong surrogate models and that they tend to outperform other popular models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of model in the analysis. We also studied the effect of re-calibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.  ( 2 min )
    Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning. (arXiv:2301.07067v1 [cs.LG])
    In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. This implicit training is in contrast to explicitly tuning the model weights based on examples. In this work, we formalize in-context learning as an algorithm learning problem, treating the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm. We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system. The crux of our analysis is relating the excess risk to the stability of the algorithm implemented by the transformer, which holds under mild assumptions. Secondly, we use our abstraction to show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes. We provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) identify an inductive bias phenomenon where the transfer risk on unseen tasks is independent of the transformer complexity, and (3) empirically verify our theoretical predictions.  ( 2 min )
    A Fast Algorithm for Adaptive Private Mean Estimation. (arXiv:2301.07078v1 [stat.ML])
    We design an $(\varepsilon, \delta)$-differentially private algorithm to estimate the mean of a $d$-variate distribution, with unknown covariance $\Sigma$, that is adaptive to $\Sigma$. To within polylogarithmic factors, the estimator achieves optimal rates of convergence with respect to the induced Mahalanobis norm $||\cdot||_\Sigma$, takes time $\tilde{O}(n d^2)$ to compute, has near linear sample complexity for sub-Gaussian distributions, allows $\Sigma$ to be degenerate or low rank, and adaptively extends beyond sub-Gaussianity. Prior to this work, other methods required exponential computation time or the superlinear scaling $n = \Omega(d^{3/2})$ to achieve non-trivial error with respect to the norm $||\cdot||_\Sigma$.  ( 2 min )
    MAFUS: a Framework to predict mortality risk in MAFLD subjects. (arXiv:2301.06908v1 [stat.ML])
    Metabolic (dysfunction) associated fatty liver disease (MAFLD) establishes new criteria for diagnosing fatty liver disease independent of alcohol consumption and concurrent viral hepatitis infection. However, the long-term outcome of MAFLD subjects is sparse. Few articles are focused on mortality in MAFLD subjects, and none investigate how to predict a fatal outcome. In this paper, we propose an artificial intelligence-based framework named MAFUS that physicians can use for predicting mortality in MAFLD subjects. The framework uses data from various anthropometric and biochemical sources based on Machine Learning (ML) algorithms. The framework has been tested on a state-of-the-art dataset on which five ML algorithms are trained. Support Vector Machines resulted in being the best model. Furthermore, an Explainable Artificial Intelligence (XAI) analysis has been performed to understand the SVM diagnostic reasoning and the contribution of each feature to the prediction. The MAFUS framework is easy to apply, and the required parameters are readily available in the dataset.  ( 2 min )
    Enhancing Deep Traffic Forecasting Models with Dynamic Regression. (arXiv:2301.06650v1 [cs.LG])
    A common assumption in deep learning-based multivariate and multistep traffic time series forecasting models is that residuals are independent, isotropic, and uncorrelated in space and time. While this assumption provides a straightforward loss function (such as MAE/MSE), it is inevitable that residual processes will exhibit strong autocorrelation and structured spatiotemporal correlation. In this paper, we propose a complementary dynamic regression (DR) framework to enhance existing deep multistep traffic forecasting frameworks through structured specifications and learning for the residual process. Specifically, we assume the residuals of the base model (i.e., a well-developed traffic forecasting model) are governed by a matrix-variate seasonal autoregressive (AR) model, which can be seamlessly integrated into the training process by redesigning the overall loss function. Parameters in the DR framework can be jointly learned with the base model. We evaluate the effectiveness of the proposed framework in enhancing several state-of-the-art deep traffic forecasting models on both speed and flow datasets. Our experiment results show that the DR framework not only improves existing traffic forecasting models but also offers interpretable regression coefficients and spatiotemporal covariance matrices.  ( 2 min )
    Neural Operator Framework for Digital Twin and Complex Engineering Systems. (arXiv:2301.06701v1 [cs.LG])
    With modern computational advancements and statistical analysis methods, machine learning algorithms have become a vital part of engineering modeling. Neural Operator Networks (ONets) is an emerging machine learning algorithm as a "faster surrogate" for approximating solutions to partial differential equations (PDEs) due to their ability to approximate mathematical operators versus the direct approximation of Neural Networks (NN). ONets use the Universal Approximation Theorem to map finite-dimensional inputs to infinite-dimensional space using the branch-trunk architecture, which encodes domain and feature information separately before using a dot product to combine the information. ONets are expected to occupy a vital niche for surrogate modeling in physical systems and Digital Twin (DT) development. Three test cases are evaluated using ONets for operator approximation, including a 1-dimensional ordinary differential equations (ODE), general diffusion system, and convection-diffusion (Burger) system. Solutions for ODE and diffusion systems yield accurate and reliable results (R2>0.95), while solutions for Burger systems need further refinement in the ONet algorithm.  ( 2 min )
    $Ae^2I$: A Double Autoencoder for Imputation of Missing Values. (arXiv:2301.06633v1 [cs.LG])
    The most common strategy of imputing missing values in a table is to study either the column-column relationship or the row-row relationship of the data table, then use the relationship to impute the missing values based on the non-missing values from other columns of the same row, or from the other rows of the same column. This paper introduces a double autoencoder for imputation ($Ae^2I$) that simultaneously and collaboratively uses both row-row relationship and column-column relationship to impute the missing values. Empirical tests on Movielens 1M dataset demonstrated that $Ae^2I$ outperforms the current state-of-the-art models for recommender systems by a significant margin.  ( 2 min )
    Deep Conditional Measure Quantization. (arXiv:2301.06907v1 [stat.ML])
    The quantization of a (probability) measure is replacing it by a sum of Dirac masses that is close enough to it (in some metric space of probability measures). Various methods exists to do so, but the situation of quantizing a conditional law has been less explored. We propose a method, called DCMQ, involving a Huber-energy kernel-based approach coupled with a deep neural network architecture. The method is tested on several examples and obtains promising results.  ( 2 min )
    Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency. (arXiv:2301.06240v1 [math.ST])
    We study optimal procedures for estimating a linear functional based on observational data. In many problems of this kind, a widely used assumption is strict overlap, i.e., uniform boundedness of the importance ratio, which measures how well the observational data covers the directions of interest. When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function. For any convex and symmetric function class $\mathcal{F}$, we derive a non-asymptotic local minimax bound on the mean-squared error in estimating a broad class of linear functionals. This lower bound refines the classical semi-parametric one, and makes connections to moduli of continuity in functional estimation. When $\mathcal{F}$ is a reproducing kernel Hilbert space, we prove that this lower bound can be achieved up to a constant factor by analyzing a computationally simple regression estimator. We apply our general results to various families of examples, thereby uncovering a spectrum of rates that interpolate between the classical theories of semi-parametric efficiency (with $\sqrt{n}$-consistency) and the slower minimax rates associated with non-parametric function estimation.  ( 2 min )
    Theoretical and computational aspects of robust optimal transportation, with applications to statistics and machine learning. (arXiv:2301.06297v1 [math.ST])
    Optimal transport (OT) theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are popular tools in statistics and machine learning. Recent studies have been remarking that inference based on OT and on $W_p$ is sensitive to outliers. To cope with this issue, we work on a robust version of the primal OT problem (ROBOT) and show that it defines a robust version of $W_1$, called robust Wasserstein distance, which is able to downweight the impact of outliers. We study properties of this novel distance and use it to define minimum distance estimators. Our novel estimators do not impose any moment restrictions: this allows us to extend the use of OT methods to inference on heavy-tailed distributions. We also provide statistical guarantees of the proposed estimators. Moreover, we derive the dual form of the ROBOT and illustrate its applicability to machine learning. Numerical exercises (see also the supplementary material) provide evidence of the benefits yielded by our methods.  ( 2 min )
    Doubly Robust Counterfactual Classification. (arXiv:2301.06199v1 [cs.LG])
    We study counterfactual classification as a new tool for decision-making under hypothetical (contrary to fact) scenarios. We propose a doubly-robust nonparametric estimator for a general counterfactual classifier, where we can incorporate flexible constraints by casting the classification problem as a nonlinear mathematical program involving counterfactuals. We go on to analyze the rates of convergence of the estimator and provide a closed-form expression for its asymptotic distribution. Our analysis shows that the proposed estimator is robust against nuisance model misspecification, and can attain fast $\sqrt{n}$ rates with tractable inference even when using nonparametric machine learning approaches. We study the empirical performance of our methods by simulation and apply them for recidivism risk prediction.  ( 2 min )
    Data-aware customization of activation functions reduces neural network error. (arXiv:2301.06635v1 [cs.LG])
    Activation functions play critical roles in neural networks, yet current off-the-shelf neural networks pay little attention to the specific choice of activation functions used. Here we show that data-aware customization of activation functions can result in striking reductions in neural network error. We first give a simple linear algebraic explanation of the role of activation functions in neural networks; then, through connection with the Diaconis-Shahshahani Approximation Theorem, we propose a set of criteria for good activation functions. As a case study, we consider regression tasks with a partially exchangeable target function, \emph{i.e.} $f(u,v,w)=f(v,u,w)$ for $u,v\in \mathbb{R}^d$ and $w\in \mathbb{R}^k$, and prove that for such a target function, using an even activation function in at least one of the layers guarantees that the prediction preserves partial exchangeability for best performance. Since even activation functions are seldom used in practice, we designed the ``seagull'' even activation function $\log(1+x^2)$ according to our criteria. Empirical testing on over two dozen 9-25 dimensional examples with different local smoothness, curvature, and degree of exchangeability revealed that a simple substitution with the ``seagull'' activation function in an already-refined neural network can lead to an order-of-magnitude reduction in error. This improvement was most pronounced when the activation function substitution was applied to the layer in which the exchangeable variables are connected for the first time. While the improvement is greatest for low-dimensional data, experiments on the CIFAR10 image classification dataset showed that use of ``seagull'' can reduce error even for high-dimensional cases. These results collectively highlight the potential of customizing activation functions as a general approach to improve neural network performance.  ( 2 min )
    Asymptotic normality and optimality in nonsmooth stochastic approximation. (arXiv:2301.06632v1 [math.OC])
    In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of H\'{a}jek and Le Cam. A long-standing open question in this line of work is whether similar guarantees hold for important non-smooth problems, such as stochastic nonlinear programming or stochastic variational inequalities. In this work, we show that this is indeed the case.  ( 2 min )
    Geometric ergodicity of SGLD via reflection coupling. (arXiv:2301.06769v1 [math.PR])
    We consider the geometric ergodicity of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm under nonconvexity settings. Via the technique of reflection coupling, we prove the Wasserstein contraction of SGLD when the target distribution is log-concave only outside some compact set. The time discretization and the minibatch in SGLD introduce several difficulties when applying the reflection coupling, which are addressed by a series of careful estimates of conditional expectations. As a direct corollary, the SGLD with constant step size has an invariant distribution and we are able to obtain its geometric ergodicity in terms of $W_1$ distance. The generalization to non-gradient drifts is also included.  ( 2 min )
    Case-Base Neural Networks: survival analysis with time-varying, higher-order interactions. (arXiv:2301.06535v1 [stat.ML])
    Neural network-based survival methods can model data-driven covariate interactions. While these methods have led to better predictive performance than regression-based approaches, they cannot model both time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNN) as a new approach that combines the case-base sampling framework with flexible architectures. Our method naturally accounts for censoring and does not require method specific hyperparameters. Using a novel sampling scheme and data augmentation, we incorporate time directly into a feed-forward neural network. CBNN predicts the probability of an event occurring at a given moment and estimates the hazard function. We compare the performance of CBNN to survival methods based on regression and neural networks in two simulations and two real data applications. We report two time-dependent metrics for each model. In the simulations and real data applications, CBNN provides a more consistent predictive performance across time and outperforms the competing neural network approaches. For a simple simulation with an exponential hazard model, CBNN outperforms the other neural network methods. For a complex simulation, which highlights the ability of CBNN to model both a complex baseline hazard and time-varying interactions, CBNN outperforms all competitors. The first real data application shows CBNN outperforming all neural network competitors, while a second real data application shows competitive performance. We highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible modeling framework for data-driven, time-varying interaction modeling of survival outcomes. An R package is available at https://github.com/Jesse-Islam/cbnn.  ( 2 min )
    Expected Gradients of Maxout Networks and Consequences to Parameter Initialization. (arXiv:2301.06956v1 [stat.ML])
    We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.  ( 2 min )
    Optimal Algorithms for Latent Bandits with Cluster Structure. (arXiv:2301.07040v1 [cs.LG])
    We consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. At each round, a user, selected uniformly at random, pulls an arm and observes a corresponding noisy reward. The goal of the users is to maximize their cumulative rewards. This problem is central to practical recommendation systems and has received wide attention of late \cite{gentile2014online, maillard2014latent}. Now, if each user acts independently, then they would have to explore each arm independently and a regret of $\Omega(\sqrt{\mathsf{MNT}})$ is unavoidable, where $\mathsf{M}, \mathsf{N}$ are the number of arms and users, respectively. Instead, we propose LATTICE (Latent bAndiTs via maTrIx ComplEtion) which allows exploitation of the latent cluster structure to provide the minimax optimal regret of $\widetilde{O}(\sqrt{(\mathsf{M}+\mathsf{N})\mathsf{T}})$, when the number of clusters is $\widetilde{O}(1)$. This is the first algorithm to guarantee such a strong regret bound. LATTICE is based on a careful exploitation of arm information within a cluster while simultaneously clustering users. Furthermore, it is computationally efficient and requires only $O(\log{\mathsf{T}})$ calls to an offline matrix completion oracle across all $\mathsf{T}$ rounds.  ( 2 min )
    From Risk Prediction to Risk Factors Interpretation. Comparison of Neural Networks and Classical Statistics for Dementia Prediction. (arXiv:2301.06995v1 [stat.AP])
    It is proposed to investigate the onset of a disease D, based on several risk factors., with a specific interest in Alzheimer occurrence. For that purpose, two classes of techniques are available, whose properties are quite different in terms of interpretation, which is the focus of this paper: classical statistics based on probabilistic models and artificial intelligence (mainly neural networks) based on optimization algorithms. Both methods are good at prediction, with a preference for neural networks when the dimension of the potential predictors is high. But the advantage of the classical statistics is cognitive : the role of each factor is generally summarized in the value of a coefficient which is highly positive for a harmful factor, close to 0 for an irrelevant one, and highly negative for a beneficial one.  ( 2 min )
    A Coreset Learning Reality Check. (arXiv:2301.06163v1 [cs.LG])
    Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.  ( 2 min )
    GAR: Generalized Autoregression for Multi-Fidelity Fusion. (arXiv:2301.05729v1 [stat.ML])
    In many scientific research and engineering applications where repeated simulations of complex systems are conducted, a surrogate is commonly adopted to quickly estimate the whole system. To reduce the expensive cost of generating training examples, it has become a promising approach to combine the results of low-fidelity (fast but inaccurate) and high-fidelity (slow but accurate) simulations. Despite the fast developments of multi-fidelity fusion techniques, most existing methods require particular data structures and do not scale well to high-dimensional output. To resolve these issues, we generalize the classic autoregression (AR), which is wildly used due to its simplicity, robustness, accuracy, and tractability, and propose generalized autoregression (GAR) using tensor formulation and latent features. GAR can deal with arbitrary dimensional outputs and arbitrary multifidelity data structure to satisfy the demand of multi-fidelity fusion for complex problems; it admits a fully tractable likelihood and posterior requiring no approximate inference and scales well to high-dimensional problems. Furthermore, we prove the autokrigeability theorem based on GAR in the multi-fidelity case and develop CIGAR, a simplified GAR with the exact predictive mean accuracy with computation reduction by a factor of d 3, where d is the dimensionality of the output. The empirical assessment includes many canonical PDEs and real scientific examples and demonstrates that the proposed method consistently outperforms the SOTA methods with a large margin (up to 6x improvement in RMSE) with only a couple high-fidelity training samples.  ( 2 min )
    Calibrated Data-Dependent Constraints with Exact Satisfaction Guarantees. (arXiv:2301.06195v1 [stat.ML])
    We consider the task of training machine learning models with data-dependent constraints. Such constraints often arise as empirical versions of expected value constraints that enforce fairness or stability goals. We reformulate data-dependent constraints so that they are calibrated: enforcing the reformulated constraints guarantees that their expected value counterparts are satisfied with a user-prescribed probability. The resulting optimization problem is amendable to standard stochastic optimization algorithms, and we demonstrate the efficacy of our method on a fairness-sensitive classification task where we wish to guarantee the classifier's fairness (at test time).  ( 2 min )
    Tale of two c(omplex)ities. (arXiv:2301.06259v1 [math.ST])
    For decades, best subset selection (BSS) has eluded statisticians mainly due to its computational bottleneck. However, until recently, modern computational breakthroughs have rekindled theoretical interest in BSS and have led to new findings. Recently, Guo et al. (2020) showed that the model selection performance of BSS is governed by a margin quantity that is robust to the design dependence, unlike modern methods such as LASSO, SCAD, MCP, etc. Motivated by their theoretical results, in this paper, we also study the variable selection properties of best subset selection for high-dimensional sparse linear regression setup. We show that apart from the identifiability margin, the following two complexity measures play a fundamental role in characterizing the margin condition for model consistency: (a) complexity of residualized features, (b) complexity of spurious projections. In particular, we establish a simple margin condition that only depends only on the identifiability margin quantity and the dominating one of the two complexity measures. Furthermore, we show that a similar margin condition depending on similar margin quantity and complexity measures is also necessary for model consistency of BSS. For a broader understanding of the complexity measures, we also consider some simple illustrative examples to demonstrate the variation in the complexity measures which broadens our theoretical understanding of the model selection performance of BSS under different correlation structures.  ( 2 min )
    Scaling Deep Networks with the Mesh Adaptive Direct Search algorithm. (arXiv:2301.06641v1 [stat.ML])
    Deep neural networks are getting larger. Their implementation on edge and IoT devices becomes more challenging and moved the community to design lighter versions with similar performance. Standard automatic design tools such as \emph{reinforcement learning} and \emph{evolutionary computing} fundamentally rely on cheap evaluations of an objective function. In the neural network design context, this objective is the accuracy after training, which is expensive and time-consuming to evaluate. We automate the design of a light deep neural network for image classification using the \emph{Mesh Adaptive Direct Search}(MADS) algorithm, a mature derivative-free optimization method that effectively accounts for the expensive blackbox nature of the objective function to explore the design space, even in the presence of constraints.Our tests show competitive compression rates with reduced numbers of trials.  ( 2 min )
    Intrinsic Gaussian Process on Unknown Manifolds with Probabilistic Metrics. (arXiv:2301.06533v1 [stat.ML])
    This article presents a novel approach to construct Intrinsic Gaussian Processes for regression on unknown manifolds with probabilistic metrics (GPUM) in point clouds. In many real world applications, one often encounters high dimensional data (e.g. point cloud data) centred around some lower dimensional unknown manifolds. The geometry of manifold is in general different from the usual Euclidean geometry. Naively applying traditional smoothing methods such as Euclidean Gaussian Processes (GPs) to manifold valued data and so ignoring the geometry of the space can potentially lead to highly misleading predictions and inferences. A manifold embedded in a high dimensional Euclidean space can be well described by a probabilistic mapping function and the corresponding latent space. We investigate the geometrical structure of the unknown manifolds using the Bayesian Gaussian Processes latent variable models(BGPLVM) and Riemannian geometry. The distribution of the metric tensor is learned using BGPLVM. The boundary of the resulting manifold is defined based on the uncertainty quantification of the mapping. We use the the probabilistic metric tensor to simulate Brownian Motion paths on the unknown manifold. The heat kernel is estimated as the transition density of Brownian Motion and used as the covariance functions of GPUM. The applications of GPUM are illustrated in the simulation studies on the Swiss roll, high dimensional real datasets of WiFi signals and image data examples. Its performance is compared with the Graph Laplacian GP, Graph Matern GP and Euclidean GP.  ( 2 min )
    A domain-decomposed VAE method for Bayesian inverse problems. (arXiv:2301.05708v1 [stat.ML])
    Bayesian inverse problems are often computationally challenging when the forward model is governed by complex partial differential equations (PDEs). This is typically caused by expensive forward model evaluations and high-dimensional parameterization of priors. This paper proposes a domain-decomposed variational auto-encoder Markov chain Monte Carlo (DD-VAE-MCMC) method to tackle these challenges simultaneously. Through partitioning the global physical domain into small subdomains, the proposed method first constructs local deterministic generative models based on local historical data, which provide efficient local prior representations. Gaussian process models with active learning address the domain decomposition interface conditions. Then inversions are conducted on each subdomain independently in parallel and in low-dimensional latent parameter spaces. The local inference solutions are post-processed through the Poisson image blending procedure to result in an efficient global inference result. Numerical examples are provided to demonstrate the performance of the proposed method.  ( 2 min )
    Compress Then Test: Powerful Kernel Testing in Near-linear Time. (arXiv:2301.05974v1 [stat.ML])
    Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.  ( 2 min )
  • Open

    DreamerV3: Mastering Diverse Domains through World Models
    Highlights  ( 4 min )

  • Open

    Is there an AI API that can relate images to text?
    I have a pretty big catalogue of digital miniatures on Notion, which all have images and tags. The proccess of tagging each entry is very laborious, so I was wondering if I could use some kind of AI service that can train on my dataset, so that I could use it to automatically fill the tags for me, and also to use it as a search engine. I know some javascript and have some experience using APIs, but I'm not a developer, so I'm looking for something that's relatively easy to use, and that I won't need to learn C# or other advanced programming languages. ​ This is a sample entry from my database submitted by /u/Rayuaz [link] [comments]  ( 43 min )
    Can anyone answer this?
    What are amazon and Google and Bing policies on AI generated text - any test cases or precedents where accounts or sites closed? submitted by /u/TheChalaK- [link] [comments]  ( 43 min )
    These boston dynamics videos just keep getting more and more concerning.
    submitted by /u/Rollyman1 [link] [comments]  ( 43 min )
    What AI can I use to make deepfakes of my voice?
    I've recently been seeing artists feeding AI their voice & having it sing for them. That's pretty cool & I'd like to try it, but all I'm finding are research papers on it and no actual open AI for me to try this. Anyone know where I can access such an AI? submitted by /u/KelkonBajam [link] [comments]  ( 50 min )
    Synthetic data is the future of AI
    https://moez-62905.medium.com/synthetic-data-is-the-future-of-artificial-intelligence-6fcfd2ce1a14 submitted by /u/repeat_or [link] [comments]  ( 46 min )
    A short story about ChatGPT3 killing the giant Google
    submitted by /u/Imagine-your-success [link] [comments]  ( 55 min )
    Generative AI and The Future of Work
    submitted by /u/_utisz_ [link] [comments]  ( 52 min )
    AI is Assisting the UN in Preventing Nuclear War
    submitted by /u/HODLTID [link] [comments]  ( 42 min )
    Anyone know a "zyral, Zyro" AI application?
    Heard it on a stream but can't seem to figure out the proper spelling of it. The streamer uses it from thumbnails. submitted by /u/madalert123 [link] [comments]  ( 43 min )
    Ryan Reyonlds Toonified using VToonify
    submitted by /u/oridnary_artist [link] [comments]  ( 42 min )
    ByteDance AI Research Proposes a Novel Self-Supervised Learning Framework to Create High-Quality Stylized 3D Avatars with a Mix of Continuous and Discrete Parameters
    submitted by /u/ai-lover [link] [comments]  ( 43 min )
    Artificial Waifus
    Some programmer on tiktok made an AI waifu (source). He wants to redo the project using a real girl's text messages. Imagine a world where women and men can sell their text histories in order to train better models of themselves to be used by others. This is just the beginning, boys. submitted by /u/crowb1rd [link] [comments]  ( 43 min )
    14 highlights from Sam Altman's interview
    From https://smokingrobot.beehiiv.com/p/sam-altman-interview-strictly-vc ​ On the unexpected progress of AI: Everyone thought at first it comes for physical labor, like working in a factory and then truck driving, then this sort of less demanding cognitive labor, and then the really demanding cognitive labor like computer programming. And then very last of all or maybe never because maybe it's like some deep human special sauce, was creativity. And of course we can look now and say it really looks like it's going to go exactly the opposite direction. On the impact on education and other changes: There are societal changes that ChatGPT is going to cause or is causing. There's I think a big one going now about the impact of this on education, academic integrity, and all of that. But star…  ( 63 min )
    New Microsoft AI can accurately mimic a human voice after analyzing a 3-second sample
    As advancements in artificial intelligence continue to unfold at a rapid pace, it is not uncommon for individuals to express concerns about the potential implications on employment opportunities for human workers. Adding fuel to these concerns is the recent announcement made by a team of researchers at Microsoft, who have developed a new AI system capable of accurately replicating a human voice using only a three-second audio sample. This breakthrough in technology highlights the potential for AI to not only automate a plethora of tasks, but also to potentially replicate human capabilities and skills with increased accuracy and efficiency. The implications of this development are significant, as it raises important questions about the future of work and the role of AI in it. Furthermore, i…  ( 48 min )
    Is there a (as complete as possible) ranking for Language Models?
    Hello AI community, as the title says I am looking for a (up-to-date) ranking list for as many LMs (BERT, RoBERTa, T5, yada yada yada) as possible with their corresponding scores in the different tasks. Is there maybe some site which is keeping track of these scores or some awesome GitHub page? Thank you for any hints! submitted by /u/Own-Technology-9815 [link] [comments]  ( 51 min )
    DeepL launches New Product ‘Write’ To Take On Grammarly
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    ✨I made a story script and Vtuber like character using 100% A.I besides editing💕✨
    submitted by /u/Recent-Dealer-5844 [link] [comments]  ( 45 min )
    Top A.I. Powered Tools Not Named ChatGPT (2nd)
    submitted by /u/BackgroundResult [link] [comments]  ( 44 min )
    Move over, ChatGPT: Israeli start-up AI21 Labs' AI to cite sources
    submitted by /u/yaitz331 [link] [comments]  ( 43 min )
    FREE Midjourney Rival Using Stable Diffusion Under The Hood!
    submitted by /u/PuppetHere [link] [comments]  ( 43 min )
    Text-to-Audio Diffusion, by flavio schneider
    Text-conditional latent audio diffusion that can generate multiple minutes of music from a textual description. See link for samples. submitted by /u/Sea_Emu_4259 [link] [comments]  ( 44 min )
    How can GPT ever compete with search databases economically?
    A GPT3 query costs at least 5c, and an google search costs 0.05 cents, that's 100 times less. Perhaps GPT3 will always be a paid service, because advertising wouldn't be profitable? I'm thinking that the cost of GPT3 will be slashed by 9 and then it will stabilize at about 5 times more expensive than databases... because the current system will be slashed by 20 times and the data volume of the NLP will grow as well. Perhaps it will cost like 10 dollars per year for a subscription to an AI query app? submitted by /u/MegavirusOfDoom [link] [comments]  ( 48 min )
    Is there an AI tool that mines your gmail and organizes things - eg sorts out your purchases and tracks warranty and more?
    Is there an AI tool that mines your gmail and organizes things - eg sorts out your purchases and tracks warranty and more? submitted by /u/dreameh [link] [comments]  ( 45 min )
    Just tested You.com AI powered Chat Box
    submitted by /u/Sphagne [link] [comments]  ( 46 min )
    A photo created by an AI
    submitted by /u/NorthTs [link] [comments]  ( 43 min )
  • Open

    Making-of for Boston Dynamic's latest Atlas demo (gripping, placing, & throwing objects; jumps & flips)
    submitted by /u/gwern [link] [comments]  ( 53 min )
    Question about optimazation problem
    Hello, I am learning currently DL, but in work I have the opportunity ( if I select to do it, I lack the theoritical background) to create an AI. I am working as analog IC engineer, in RF circuits we have transformers which they match the Zout (Output impedance) of a block to the Zin of the next block. The transformer in schematic level is comprised by 2 capacitors, 2 inductors and the coupling factor, the output which we want to have a flat gain at the freq range that we want e.g. 76 -81 GHz. Currently the rf engineers work from experiance the transfomers and they start trimming to match the Zin = Zout. because we have a lot of transformers and I think this job will be better/faster to make an AI I was thinking about RL, but I lack the experiance. So I want to ask if someone has any sources/recommendation to study and do some examples with similar objective. ​ thank you in advance submitted by /u/InvokeMeWell [link] [comments]  ( 54 min )
    I got a project that focuses on marketing and it was suggested by my senior in work that I should try reading about MAB. Aside from MAB, is there any alternatives that covers the ground of RL?
    Basically, I'm a data scientist in an AU company, most of the time we can get away it with simple hypothesis testing in this sort of stuff , linear programming and probably machine learning approaches, but the team want to do more than that so we want to go bonkers and try MAB, aside from MAB what stochastic method I should read/study so I can contribute in our current project. submitted by /u/noodlepotato [link] [comments]  ( 62 min )
    PPO with Transformer or Attention Mechanism
    I am interested in testing PPO with an attention mechanism from a psychological perspective. I was wondering if someone has successfully customized the stable_baselines3 with an attention mechanism submitted by /u/partyjunk [link] [comments]  ( 55 min )
    What does it mean if your actor is converging and your critic is diverging?
    I am trying to train an agent with DDP3+SWA. The actor's loss is going up, which I believe is a good thing because the loss is a negative expected reward, but the critic's loss is also going up so it's diverging. ​ Does anyone have any ideas about what could cause this to happen? submitted by /u/rawrzapan [link] [comments]  ( 53 min )
  • Open

    [R] Summary of developments in ML in 2022
    Google Research has a blog post of advances in ML over the last year. It obviously focusses on stuff Google Research has been involved in, but from such a big research group, thats pretty much everything. Here it is It's a good way current if you don't have time to read every paper! (Note that some sections aren't yet published) submitted by /u/londons_explorer [link] [comments]  ( 57 min )
    [P] We made an image de-identifier using Stable Diffusion!
    We combined image captioning using CLIP and image generation using the Hugging Face Stable Diffusion model to create an image de-identifier modeled after the game of telephone, Imafake! All you have to do is upload an image, convert it to a caption, then convert that caption to an image with a few clicks! You can also play with the parameters of the diffusion model depending on how gnarly you want your resulting image to be. And caution, they can get rather gnarly, but that’s what makes it fun :) Thoughts and your own generated images welcome!! https://preview.redd.it/z818pd8fivca1.jpg?width=1276&format=pjpg&auto=webp&s=54ed1a70e6ba7aa52fb662d9213e46ed5f559e5a submitted by /u/Djinn_Tonic4DataSci [link] [comments]  ( 57 min )
    [P] MNIST Clock - Generating MNIST digits on the fly in your browser
    ​ MNIST CLOCK Project https://github.com/tecbar/mnist-clock Live demo https://tecbar.github.io/mnist-clock/ (it can load for a while, because it needs to download ONNX runtime) Description Hey, this is my pet project. I trained a very simple model on MNIST dataset. The task is you input a digit and it can generate output image representation of that digit. Each time it generates a little bit different digit, because of the applied noise - actually the digit vector is applied on gaussian noise. I didn't know what to do with it so I exported the model to ONNX and used ONNX web runtime to arrange the digits in a clock - so basically everything on the live demo site is running in your browser and the clock is refreshed about 20 times per second (which was arbitrary choice). The training procedure was really simple - instead of predicting a label based on an image it tries to predict an image based on a label. Here is PyTorch implemetation. It works just fine with only two linear layers. Problems The generated digits are blurry, I guess this is because I didn't use any GAN or VAE based architecture, so the model has no idea about anything basically. ​ Model ​ https://preview.redd.it/n2lsobiigvca1.png?width=438&format=png&auto=webp&s=565bda13fd4f0d5b60cb9a71828053c726ee5301 submitted by /u/tecbar [link] [comments]  ( 58 min )
    [D] [N] Book: Multimodal Deep Learning - 239 Pages! - Matthias Aßenmacher et al
    In my opinion a must read because Multimodal Deep Learning is the future! Also because papers like this: https://arxiv.org/abs/2301.03728 show that Multimodular models significantly outperform unimodular models! Book: https://arxiv.org/pdf/2301.04856.pdf Github: https://github.com/slds-lmu/seminar_multimodal_dl Abstract: This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet. https://preview.redd.it/vb9lycxmfvca1.jpg?width=641&format=pjpg&auto=webp&s=6b25e0d051d5cb5c3ec0117bb383e59c23c2f984 submitted by /u/Singularian2501 [link] [comments]  ( 55 min )
    [D] Automated Extraction of Building Geometry
    I need to figure out a way to automatically create a 2D one-line drawing given a point cloud of a building. I figure this is the rough workflow of that operation, but I need to define this workflow with a much higher resolution to acquire the right tools and talent for the project. Is this a suitable application for Machine Learning? If you have any insight or ideas to share that would be very much appreciated, thanks! https://preview.redd.it/4hv1ba5h0vca1.png?width=2160&format=png&auto=webp&s=f8a9e6ae45a621d72fedceeefd7a2b06599577aa submitted by /u/EducationalLayer1051 [link] [comments]  ( 53 min )
    [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF)
    ​ https://preview.redd.it/k3ims1d6zuca1.png?width=2324&format=png&auto=webp&v=enabled&s=4f4bbe410508bdd4c45f45e55dd5c1ea0fcb5fcc You must have heard about ChatGPT. Maybe you heard that it was trained with RLHF and PPO. Perhaps you do not really understand how that process works. Then check my Gist on Reinforcement Learning from Human Feedback (RLHF): https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093 No hard maths, straight to the point and simplified. Hope that it helps! submitted by /u/JClub [link] [comments]  ( 57 min )
    [R] Call for Papers: 2nd International Symposium on the Tsetlin Machine
    ​ CfP ISTM 2023 Calling all machine learning researchers to contribute to or participate in the 2nd International Symposium on the Tsetlin Machine @ Newcastle upon Tyne. Please consider submitting your original, high-quality research works on any emerging ML hardware, software, application, or algorithmic topics. The emerging paradigm of Tsetlin machines provides a fundamental shift from arithmetic-based to logic-based machine learning. At the core, finite-state machines, based on learning automata, learn patterns using logical clauses, and these constitute a global description of the task learnt. In this way, the Tsetlin machine introduces the concept of logical interpretable learning, where both the learned model and the process of learning are easy to follow and explain. As a result, it reduces the expertise needed to apply ML techniques efficiently in various domains. The paradigm has enabled competitive accuracy, scalability, memory footprint, inference speed, and energy consumption across diverse tasks, including classification, convolution, regression, natural language processing (NLP), and speech understanding. https://istm.no submitted by /u/olegranmo [link] [comments]  ( 60 min )
    [D] Do you know of any model capable of detecting generative model(GPT) generated text ?
    I'm looking to detect spams generated by generative models (especially gpt). But all the ones I tried fail miserably ... submitted by /u/CaptainDifferent3116 [link] [comments]  ( 62 min )
    [R] Researchers out there: which are current research directions for tree-based models?
    Hi everybody, I've been skimming this paper since yesterday and was once again impressed by the expressiveness and practicality of tree-based models. I wondered what current research directions are in the field and what novel ideas have been presented in the last years - beyond improving performances. Examples may include better explainability, online learning, splitting criteria, enhanced or customizable loss functions, adding structure or constraints, shortcomings .... submitted by /u/BenXavier [link] [comments]  ( 59 min )
    [P] AI for Materials community
    Hey everyone, working on getting started an open and collaborative community/lab at intersection of ML/AI and materials science. One big reason is because it’s a neglected area with lots of potential with generative modeling for new discoveries. A small roadmap is we want to have intro talks on the topic to ramp members up, talks from leading researchers, of course we will be training models, trying to create larger datasets, and hopefully getting access to synthesis our findings. If this sounds interesting to you checkout the website at https://ai4mlab[dot]com and consider joining! Thanks! submitted by /u/theredditbrowser1 [link] [comments]  ( 58 min )
    [R] tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation (480 tasks+ sota encoder)
    submitted by /u/Jean-Porte [link] [comments]  ( 60 min )
    [D] How much can you add/change in a camera ready conference paper?
    Fingers and toes crossed I might have a paper accepted at ICLR, and I'm wondering how much I can add/change in the camera ready version. Typos and exposition for clarity, I assume are fine to add/change. And in some cases I have seen meta-reviews say "please address comments XYZ in the camera ready version", so I assume there is a lot of leeway. But in the absence of such a comment or comments in particular about something you should change, is it okay to change/add stuff? And if so to what degree (while keeping to the page limit). submitted by /u/tfburns [link] [comments]  ( 61 min )
  • Open

    Gaining real-world industry experience through Break Through Tech AI at MIT
    A new experiential learning opportunity challenges undergraduates across the Greater Boston area to apply their AI skills to a range of industry projects.  ( 9 min )
  • Open

    Chatbots in Healthcare [Part 2]
    In April 2017 I wrote this story on the potential use of chatbots in healthcare: https://medium.com/p/984fc23e0410 . It got over 3.5K…  ( 22 min )
    The 10 Most Powerful AI Software Products in 2023
    Businesses are moving towards AI Software Products. In fact, a recent study proves this claim by saying that nine out of ten companies…  ( 18 min )
    AI Writing Tools for Creative Writing and Fiction: Unleash Your Imagination and Write Like a Pro
    No content preview
    Multicollinearity: A Guide to Understanding and Managing the Problem in Regression Models
    Multicollinearity is a common problem that might happen in multiple regression analysis, where two or more predictor variables are highly…  ( 11 min )
    Designing great AI products — Building trust
    The following post is an excerpt from my book ‘Designing Human-Centric AI Experiences’ on applied UX design for Artificial intelligence.  ( 13 min )
    The massive disruption nobody is talking about, yet.
    Bold prediction: the evolution of machine learning models (GPT, Gopher, …) combined with the ubiquity of messaging apps (WhatsApp… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Is AI going to take my Job?☹️
    Artificial intelligence (AI) and its various applications, such as ChatGPT (if you don’t know about Chat GPT, check out my post), are…  ( 11 min )
  • Open

    Simple neural networks outperform the state-of-the-art for controlling robotic prosthetics
    submitted by /u/keghn [link] [comments]  ( 67 min )
    Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
    submitted by /u/nickb [link] [comments]  ( 68 min )
    Project advice - Deep Learning 3D models - python libraries, GPU memory management, data structuring
    I am working with 64x64x64 voxel arrays and am running into significant problems with GPU memory management. I am using TensorFlow and have an NVIDIA GeForce RTX 4080 MSI Ventus edition with 16GB of memory (purchased using research grant funding... it's sitting in a hacked together eGPU setup lol). It performs beautifully on 32x32x32 data but I can't even get started with the larger data format. I have tried limiting GPU data utilization per process, as per this post and limiting memory growth, as per this post (Ctrl+F "second option"). I have 64GB of RAM so I can fit the data into memory (even though I know that's not efficient) and was trying to put that data in a TensorFlow Dataset object, in which, according to the docs, "iteration happens in a streaming fashion, so the full dataset do…  ( 62 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters don’t misunderstand AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 60 min )
  • Open

    Google Research, 2022 & Beyond: Language, Vision and Generative Models
    Posted by Jeff Dean, Senior Fellow and SVP of Google Research, on behalf of the Google Research community Today we kick off a series of blog posts about exciting new developments from Google Research. Please keep your eye on this space and look for the title “Google Research, 2022 & Beyond” for more articles in the series. I’ve always been interested in computers because of their ability to help people better understand the world around them. Over the last decade, much of the research done at Google has been in pursuit of a similar vision — to help people better understand the world around them and get things done. We want to build more capable machines that partner with people to accomplish a huge variety of tasks. All kinds of tasks. Complex, information-seeking tasks. Creative t…  ( 112 min )
  • Open

    Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I
    A few days ago, I responded to a post on LinkedIn about how Google seems to always find a way to keep ahead of the pack, even when someone of importance leaves the company.  It occurred to me that NFL teams have to adapt and remake themselves from season to season, as players and coaches… Read More »Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I The post Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I appeared first on Data Science Central.  ( 22 min )
    A Practical Guide to Using Computer Vision for Business Growth
    Isn’t it fascinating how our brain processes the vast amounts of information that we receive throughout the day? Our sensory organs convert information into stimuli as they process the information they receive. Complex processes like recognizing and detecting objects require only a split second of the brain’s attention. A computer can replicate human vision using… Read More »A Practical Guide to Using Computer Vision for Business Growth The post A Practical Guide to Using Computer Vision for Business Growth appeared first on Data Science Central.  ( 24 min )
  • Open

    Sequoia Capital’s Pat Grady and Sonya Huang on Generative AI
    For insights into the future of generative AI, check out the latest episode of the NVIDIA AI Podcast. Host Noah Kravitz is joined by Pat Grady and Sonya Huang, partners at Sequoia Capital, to discuss their recent essay, “Generative AI: A Creative New World.” The authors delve into the potential of generative AI to enable Read article >  ( 4 min )
    Roll Model: Smart Stroller Pushes Its Way to the Top at CES 2023
    As any new mom or dad can tell you, parenting can be a challenge — packed with big worries and small hassles. But it may be about to get a little bit easier thanks to Glüxkind Technologies and their smart stroller, Ella. The company has just been named a CES 2023 Innovation Awards Honoree for Read article >  ( 6 min )
    Artist Zhelong Xu Brings Chinese Zodiac to Life for Lunar New Year This Week ‘In the NVIDIA Studio’
    To celebrate the upcoming Lunar New Year holiday, NVIDIA artist Zhelong Xu, aka Uncle Light, brought Chinese zodiac signs to life this week In the NVIDIA Studio — modernizing the ancient mythology in his signature style.  ( 7 min )
  • Open

    Converting between barycentric and trilinear coordinates
    Barycentric coordinates describe the position of a point relative to the three vertices of a triangle. Trilinear coordinates describe the position of a point relative to the three sides of a triangle. It’s surprisingly simple to convert from one to the other. Why should this be surprising? Because the distance from a point to a […] Converting between barycentric and trilinear coordinates first appeared on John D. Cook.  ( 5 min )

  • Open

    [D] Sport outcome predictions
    Hi all, I'm wondering why predicting outcomes of sport events like football or horse racing hasn't been achieved with machine learning tools? I guess historical data is abundant for back testing. What is so challenging with this problem? submitted by /u/proudm0 [link] [comments]  ( 56 min )
    [P] Image classification
    Hi guys, I am currently working on a DL problem where I have to classify an image dataset from Kaggle into 5 classes. The first task is to train a NN from scratch that overfits the data and then I have to modify the training process so that the network is trained without overfitting for more than double the number of epochs in the first task, keeping the same architecture, number of training images, optimizer, batch size and learnnig rate I used . I am allowed to use any architecture (resnet, alexnet, moblinet etc) or a custom model. As of now, I have tried to use resnet18 and the model overfits the data. For the second task, I apply data augmentation techniques to the training set, but the model still overfits and I am not able to find any solution. One thing I noticed, is that the validation loss at the first epochs is way lower than the training loss and it is saturating. I also tried to use a mobilenet but it stil overfits no matter how many or what augmentations I use. Can anyone recommend a solution ? submitted by /u/grisp98 [link] [comments]  ( 68 min )
    [D] Why aren't we all using linear transformers?
    There's a bunch of them - Linformer, Longformer, Performer, Nystromformer, Big Bird, etc etc. Plus a bunch more that have similar goals but don't necessarily aim for linear complexity, like memory-augmented transformers. As far as I know, none of them have really seen much use. Even for image problems, which have very long input sizes, people are using regular transformers with tokenization schemes. Am I wrong? Are they actually good, or are at least some of them better than regular transformers? If not, what's wrong with them? Do they have lower accuracy? Are they slower to train? submitted by /u/currentscurrents [link] [comments]  ( 56 min )
    [D] RLHF - What type of rewards to use?
    Hey everyone, just saw the great presentation of Nathan Lambert on Reinforcement Learning from Human Feedback and wanted to try to do some RLHF on my language model.To do this, first I need to create an experience where I collect reward scores to train the reward model. My question is: what rewards work best? Simply 👍/👎? A scale of 1-5? Ranking 4 different model outputs? There are a lot of options and I don't know which one to choose. submitted by /u/JClub [link] [comments]  ( 52 min )
    [P] Need advice on inventory planning for my capstone project.
    Hello everyone. I'm currently doing a capstone project at my university. In this project I'm working with a fashion company to address their inventory issue. They have big problems on pin pointing demand for specific products, so often end up over buying their inventory. My capstone instructor suggested we do a cluster analysis to see which products have similar demands. I'm also posting here to see what approach you guys would take to address this inventory issue submitted by /u/dingdong1882 [link] [comments]  ( 52 min )
    [R] Forcing GPT-N To Be Honest Without Supervision
    In his paper Discovering Latent Knowledge In Language Models (previous discussion), Collin Burns explains how you can train a probe on the hidden states of a language model that would classify if the model thinks an input his true or false, without access to ground truth labels. In a recent interview, he discusses high-level arguments for why this approach might work at scale on making GPT-N honest. He also talks more generally about his approach to doing research. submitted by /u/MuskFeynman [link] [comments]  ( 56 min )
    [R]: 15-step framework to analyze your chatbot and designate improvement steps
    Are you sure your Conversational AI solution is on the right path? Our chatbot evaluation metrics pinpoint if your solution leveraging the best of the industry’s leading practices, meeting user expectations, and fully taking advantage of the available technology to ensure frictionless and efficient experiences. https://masterofcode.com/chatbot-analysis-framework submitted by /u/Marinuch [link] [comments]  ( 57 min )
    [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers
    Hi everyone. I am training my RWKV 14B ( https://github.com/BlinkDL/RWKV-LM ) on the Pile (332B tokens) and it is getting closer to GPT-NeoX 20B level. You can already try the latest checkpoint. https://preview.redd.it/7ycdftmjvmca1.png?width=1174&format=png&auto=webp&s=860a41193f1a254299d48a173756ecd66ccbc75b RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM). At this moment, RWKV might be the only pure RNN that scales like usual transformers for language modeling, without using any QKV attention. It's great at preserving long context (unlike LSTM). Moreover, you get smooth spike-free carefree training experience (bf16 & Adam): https://preview.redd.it/0g3lrg6mvmca1.png?width=871&format=png&auto=webp&s=b4de1af4831ec359079cf99c41df8aa9591d48b0 As a proof of concept, I present ChatRWKV ( https://github.com/BlinkDL/ChatRWKV ). It's not instruct-tuned yet, and there are few conversations in the Pile, so don't expect great quality. But it's already fun. Chat examples (using slightly earlier checkpoints): https://preview.redd.it/zyqni6bpvmca1.png?width=1084&format=png&auto=webp&s=038fd2eab524c36d8aa2a8720a2caa3eb420df5b https://preview.redd.it/xhje4j7qvmca1.png?width=1200&format=png&auto=webp&s=7e8597d2370f9f87230560dac7f5439520384dd9 And you can chat with the bot (or try free generation) in RWKV Discord (link in Github readme: https://github.com/BlinkDL/RWKV-LM ). This is an open source project and let's build together. submitted by /u/bo_peng [link] [comments]  ( 65 min )
    [D] Unlocking the Potential of ChatGPT: A Community Discussion
    OpenAI's announcement of the release of the ChatGPT API has many of us excited about the potential applications and implications of this powerful language model. It has the ability to revolutionize the way we interact with technology and solve a wide range of problems. As a community, let's discuss the possibilities. What are some unique and innovative ways ChatGPT could be utilized? Are there any particular industries or markets that you think could benefit from the integration of ChatGPT? Let's share our thoughts and ideas, and explore the potential of this technology together. It's always exciting to see how advancements in AI can improve our world. ​ This post was written by ChatGPT submitted by /u/North-Ad6756 [link] [comments]  ( 56 min )
    [D] Are there any results on convergence guarantees when optimizing NNs?
    Given a function in some space, I have literature results that say, the function can theoretically be approximated by a Neural Network of such complexity with so many layers, of such width, with this specific given activation function. OK, so theoretically, there is a set of weights and biases that will result in a pretty good approximation of my function. Now the question is, how do I know that given an optimization method, for example stochastic gradient descent, I will actually reach this minimum or near enough to it, in so many training steps, or even at all? I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to. I am interested in results related to this question. submitted by /u/Dartagnjan [link] [comments]  ( 57 min )
    [N] Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content
    From the article: Getty Images is suing Stability AI, creators of popular AI art tool Stable Diffusion, over alleged copyright violation. In a press statement shared with The Verge, the stock photo company said it believes that Stability AI “unlawfully copied and processed millions of images protected by copyright” to train its software and that Getty Images has “commenced legal proceedings in the High Court of Justice in London” against the firm. submitted by /u/Wiskkey [link] [comments]  ( 63 min )
  • Open

    Join us today at 11pm EST for this week's (free) seminar session of the 9-part series on Neural Networks Architectures by Pablo Duboue!
    Happening tonight at 11 pm EST on the Learn AI Together Discord server. This week's seminar session is about Popular Network Architectures. More precisely, Pablo will present... Multi-task learning. Siamese Networks. Generative Adversarial Networks (GAN). Style Transfer. Disentangled Representation Learning. Rich Caruana (1997). “Multitask learning”. In: Machine learning 28.1, pp. 41–75 Ting Gong et al. (Sept. 2019). “A Comparison of Loss Weighting Strategies for Multi task Learning in Deep Neural Networks”. In: IEEE Access PP, pp. 1–1. DOI : 10.1109/ACCESS.2019.2943604 Jane Bromley et al. (1993). “Signature verification using a "siamese" time delay neural network”. In: Advances in neural information processing systems 6 Ian Goodfellow, Jean Pouget-Abadie, et al. (2014). “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc. Xi Chen et al. (2016). “Infogan: Interpretable representation learning by information maximizing generative adversarial nets”. In: Advances in neural information processing systems 29 Leon A Gatys, Alexander S Ecker, and Matthias Bethge (2016). “Image style transfer using convolutional neural networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423 Sounds interesting? Join our Discord community to attend the event and future ones: https://discord.gg/c6kbhNdmmA?event=1062742110295572500 submitted by /u/OnlyProggingForFun [link] [comments]  ( 52 min )
    Sugandha Sharma, MIT: On biologically inspired neural architectures, how memories can be implemented, and control theory
    Here is a podcast episode with Sugandha Sharma from MIT where we discuss how memories can be implemented, control theory, and much more! submitted by /u/thejashGI [link] [comments]  ( 44 min )
    Why Falling in Love with AI is a Dangerous Illusion — The Limitations and Harms of Artificial…
    submitted by /u/SupPandaHugger [link] [comments]  ( 48 min )
    AI to analyze data/spreadsheets, any thoughts?
    I came across this AI chatbot recently, where you can ask quantitative/qualitative questions about your data/spreadsheet in English. It felt like if ChatGPT and Excel had a baby LOL It worked for my qualitative survey data -- shocking... Do you know any other ones like this? Any thoughts in general? submitted by /u/AdDry9057 [link] [comments]  ( 46 min )
    How can AI help people in developing countries?
    I am planing to take part in a AI contest, so I am collecting ideas about my project. I think my project will be more recognized if it will have to do something with helping people in developing countries, so my question is: how can AI help people in developing countries? submitted by /u/zazabuzala [link] [comments]  ( 49 min )
    AI created content.
    Should we know what content was created using AI? What should we do to develop AI detection tools or any other ideas? submitted by /u/Andrey_Taran [link] [comments]  ( 44 min )
    Trullion to release AI bookkeeping software
    Trullion, a leading accounting automation platform, has launched two new AI-enabled modules, Revenue by Trullion and Audit by Trullion, to modernize and digitize the process of accounting. The first module, Revenue by Trullion, uses AI to synchronize customer relationship management (CRM), billing, and contract data into a single platform for internal and external stakeholders, allowing for ERP entries, disclosure reports, and advanced reporting to be generated quickly and accurately. The second module, Audit by Trullion's Test of Details workflow, uses AI to extract ERP/General Ledger (GL) files and instantly validate them against source data, such as invoices, PDFs, and other client sources. Found on https://deathtohumans.com/post/openai-monetizes-chatgpt submitted by /u/crowb1rd [link] [comments]  ( 46 min )
    Are you sure your Conversational AI solution is on the right path? 🤔 15-step framework to analyze your chatbot and designate improvement steps
    submitted by /u/Marinuch [link] [comments]  ( 48 min )
    The AI Lawyer Preparing To Defend a Real US Court Case for the First Time Ever Has Terrible Reviews
    submitted by /u/HODLTID [link] [comments]  ( 44 min )
    Two AI workers in London looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the expert views on likely future impacts and are in the process of starting a website and YouTube channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure voters pressure the government to develop AI safely. · Alex – researches, writes, and records the audio · Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to be involved to see if you might be interested too. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. submitted by /u/TheOptimisticRogue [link] [comments]  ( 48 min )
    🚀Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    submitted by /u/oridnary_artist [link] [comments]  ( 43 min )
    Microsoft’s Azure OpenAI Service Gets A Boost With ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    AI title for human creators
    does anyone think at any point, AIs will/would (or some independent ones) would call us, their human creators, Rule Makers? Instead of the Cybernetic Gods are are coming to be. submitted by /u/jonfxm1891 [link] [comments]  ( 50 min )
  • Open

    "Neural probabilistic motor primitives for humanoid control", Merel et al 2018 {DM}
    submitted by /u/gwern [link] [comments]  ( 54 min )
    Why is 'reward shaping' neglected?
    There are 6 levers. All 6 must be activated to recover reward. Scenario 1: Pull = -1 Outcome: Agent commits suicide (in order to minimize the negative rewards accumulating). ​ Scenario 2: Pull = +1 Outcome: Agent pulls the same lever forever. ​ Scenario 3: Pull = 0 i.e. Sparse reward Outcome: Agent doesn't accidentally pull all 6 often enough; doesn't learn anything useful. ​ No matter what algorithm (or how groundbreaking it is), Authors rarely justify their choices with respect to rewards. I don't mean to be able to compare and benchmark. I mean fundamentally, what's driving learning in the scenario is never substantiated. Analogy: Exact value of a hyperparameter is irrelevant, but having hyperparameters at all should and often is discussed. Am I missing something? submitted by /u/XecutionStyle [link] [comments]  ( 57 min )
    What's the best "Non-Black Box" framework for SOTA algorithms?
    Hi all, In my research I usually implement the algorithms from scratch, but given that this not only takes a lot of programing time and may introduce bugs without us knowing, I would like to know that are the best hackable frameworks out there. I would like a good framework that is easy to tinker with the algorithms' code and have good agent interfaces (agent.act, agent.step, etc...) and does not encapsulate everything (like god forbid, fucking stable-baselines and others that do things like model.learn(env)). What are your recommendations? submitted by /u/HyperionTone [link] [comments]  ( 54 min )
  • Open

    DSC Weekly 17 January 2023 – The Creative Spark in AI
    Announcements The Creative Spark in AI The idea that AI can generate art that can mimic human artists came to the forefront of discussions about AI ethics in 2022. There are undoubtedly many legal and ethical issues to tackle in those cases. If a model is trained on thousands of examples of a specific artist’s… Read More »DSC Weekly 17 January 2023 – The Creative Spark in AI The post DSC Weekly 17 January 2023 – The Creative Spark in AI appeared first on Data Science Central.  ( 20 min )
    Python for Business Analytics: Top Benefits
    Companies and businesses today need modern programming tools in order to build the many, many advanced tools and solutions they need to keep their operations running seamlessly. So, what do companies use to build business analysis solutions? Python! Why? Because it is easy to learn, offers high-quality community support, etc. Python is an all-purpose programming… Read More »Python for Business Analytics: Top Benefits The post Python for Business Analytics: Top Benefits appeared first on Data Science Central.  ( 19 min )
    Mobile Biometric Solutions: Game-Changer in the Authentication Industry
    Mobile-based biometrics is a technology that allows users to authenticate themselves and access services using unique physical characteristics such as fingerprints, facial recognition, and iris scans. These biometric authentication methods have become increasingly popular in recent years due to their convenience and security. There are several types of smartphone-based biometrics technology currently available, including: Emerging… Read More »Mobile Biometric Solutions: Game-Changer in the Authentication Industry The post Mobile Biometric Solutions: Game-Changer in the Authentication Industry appeared first on Data Science Central.  ( 20 min )
    What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk?
    What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? ie a chatGPT killer Recently, Demis Hassabis from DeepMind has been urging caution (DeepMind’s CEO Helped Take AI Mainstream. Now He’s Urging Caution Time magazine/Davos) DeepMind also announced a new chat engine called Sparrow – supposedly a chatGPT killer  Sparrow is not… Read More »What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? The post What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? appeared first on Data Science Central.  ( 19 min )
    Preconditions for decoupled and decentralized data-centric systems
    During a presentation at the TechTarget/BrightTALK Accelerating Cloud Innovation event this past December, I named the fifth phase of compute, networking and storage that we’ve entered the “Decoupled” and “Decentralized” Cloud.The quotation marks emphasized that what we’ve been experiencing is neither truly decoupled nor decentralized, but even so, the direction we’re headed in is toward… Read More »Preconditions for decoupled and decentralized data-centric systems The post Preconditions for decoupled and decentralized data-centric systems appeared first on Data Science Central.  ( 21 min )
    5 Tips To Protect Yourself from Identity Theft in 2023
    Identity theft is the process of stealing personally identifiable information (PII) to either defraud the victim or make the victim a scapegoat in a large-scale cyberattack. Attackers gain access to sensitive information such as social security numbers and credit cards that are used to collate a person’s identity.   According to a report by the Federal… Read More »5 Tips To Protect Yourself from Identity Theft in 2023 The post 5 Tips To Protect Yourself from Identity Theft in 2023 appeared first on Data Science Central.  ( 22 min )
    What is a Good Net Promoter Score for the Hotel/Resort Industry?
    The hotel industry is competitive, and it is solely dependent on customer satisfaction. Customers are key.  The hotel industry knows this and the importance of the NPS score for customer satisfaction. A better NPS score means satisfied/loyal customers.  What hotels have in their control is the website user interface, menu, and providing a seamless customer… Read More »What is a Good Net Promoter Score for the Hotel/Resort Industry? The post What is a Good Net Promoter Score for the Hotel/Resort Industry? appeared first on Data Science Central.  ( 21 min )
    6 Benefits of Data Science for Your Business
    We would not discover a new planet if we claimed that modern business harnesses the power of data science. Data science is used for a variety of purposes in a variety of industries. Furthermore, we would like to discuss the benefits of Data science for business in general. But before that, let’s define what Data… Read More »6 Benefits of Data Science for Your Business The post 6 Benefits of Data Science for Your Business appeared first on Data Science Central.  ( 22 min )
    7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023
    By 2025, more than 80% of enterprises will shift from traditional data centers to the cloud or third-party colocation data centers. For most businesses, data is an irreplaceable asset and a key investment area for future growth. Virtual colocation is becoming the talk of how data centers are shifting to adapt to growing business environments.… Read More »7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023 The post 7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023 appeared first on Data Science Central.  ( 20 min )
  • Open

    Set up Amazon SageMaker Studio with Jupyter Lab 3 using the AWS CDK
    Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) partly based on JupyterLab 3. Studio provides a web-based interface to interactively perform ML development tasks required to prepare data and build, train, and deploy ML models. In Studio, you can load data, adjust ML models, move in between steps to adjust experiments, […]  ( 6 min )
    Churn prediction using multimodality of text and tabular features with Amazon SageMaker Jumpstart
    Amazon SageMaker JumpStart is the Machine Learning (ML) hub of SageMaker providing pre-trained, publicly available models for a wide range of problem types to help you get started with machine learning. Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. Customer churn is […]  ( 14 min )
  • Open

    🚀Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    submitted by /u/oridnary_artist [link] [comments]  ( 53 min )
  • Open

    NVIDIA and Dell Technologies Expand AI Portfolio
    In their largest-ever joint AI initiative, NVIDIA and Dell Technologies today launched a wave of Dell PowerEdge systems available with NVIDIA acceleration, enabling enterprises to efficiently transform their businesses with AI. A total of 15 next-generation Dell PowerEdge systems can draw from NVIDIA’s full AI stack — including GPUs, DPUs and the NVIDIA AI Enterprise Read article >  ( 5 min )
  • Open

    Special primality proofs
    I’ve written lately about two general ways to prove that a number is prime: Pratt certificates for moderately-large primes and elliptic curve certificates for very large primes. If you can say more about the prime you wish to certify, there may be special forms of certificates that are more efficient. In particular, there are efficient […] Special primality proofs first appeared on John D. Cook.  ( 5 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2023-02-16T00:56:12.044Z osmosfeed 1.15.1